 # On the Perceptron's Compression

We study and provide exposition to several phenomena that are related to the perceptron's compression. One theme concerns modifications of the perceptron algorithm that yield better guarantees on the margin of the hyperplane it outputs. These modifications can be useful in training neural networks as well, and we demonstrate them with some experimental data. In a second theme, we deduce conclusions from the perceptron's compression in various contexts.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The perceptron is an abstraction of a biological neuron that was introduced in the 1950’s by Rosenblatt

, and has been extensively studied in many works (see e.g. the survey ). It receives as input a list of real numbers (various electrical signals in the biological case) and if the weighted sum of its input is greater than some threshold it outputs and otherwise (it fires or not in the biological case).

Formally, a perceptron computes a function of the form where

is the weight vector,

is the threshold, is the standard inner product, and is on the non-negative numbers. It is only capable of representing binary functions that are induced by partitions of by hyperplanes.

###### Definition.

A map over a finite set is (linearly)111We focus on the linear case, when the threshold is . A standard lifting that adds a coordinate with to every vector allows to translate the general (affine) case to the linear case. This lifting may significantly decrease the margin; e.g., the map on defined by and has margin in the affine sense, but the lift to and in yields very small margin in the linear sense. This solution may therefore cause an unnecessary increase in running time. This tax can be avoided, for example, if one has prior knowledge of . In this case, setting the last coordinate to be does not significantly decrease the margin. In fact, it can be avoided without any prior knowledge using the ideas in Algorithm 3 below. separable if there exists such that for all . When the Euclidean norm of is , the number is the margin of with respect to . The number is the margin of . We call an -partition if its margin is at least .

Variants of the perceptron (neurons) are the basic building blocks of general neural networks. Typically, the sign function is replaced by some other activation function (e.g., sigmoid or rectified linear unit

). Therefore, studying the perceptron and its variants may help in understanding neural networks, their design and their training process.

### Overview

In this paper, we provide some insights into the perceptron’s behavior, survey some of the related work, deduce some geometric applications, and discuss their usefulness in other learning contexts. Below is a summary of our results and a discussion of related work, partitioned to five parts numbered (i) to (v). Each of the results we describe highlights a different aspect of the perceptron’s compression (the perceptron’s output is a sum of small subset of examples). For more details, definitions and references, see the relevant sections.

(i) Variants of the perceptron (Section 2). The well-known perceptron algorithm (see Algorithm 1 below) is guaranteed to find a separating hyperplane in the linearly separable case. However, there is no guarantee on the hyperplane’s margin compared to the optimal margin . This problem was already addressed in several works, as we now explain (see also references within). The authors of  and  defined a variant of the perceptron that yields a margin of the form ; see Algorithm 2 below. The authors of  defined the passive-aggressive perceptron algorithms that allow e.g. to deal with noise, but provided no guarantee on the margin of the output. The authors of  defined a variant of the perceptron that yields provable margin under the assumption that a lower bound on the optimal margin is known. The author of  designed the ALMA algorithm and showed that it provides almost optimal margin under the assumption that the samples lie on the unit sphere. It is worth noting that normalizing the examples to be on the unit sphere may significantly alter the margin, and even change the optimal separating hyperplane. The author of  defined the minimal overlap algorithm which guarantees optimal margin but is not online since it knows the samples in advance. Finally, the authors of 

analyzed gradient descent for a single neuron and showed convergence to the optimal separating hyperplane under certain assumptions (appropriate activation and loss functions).

We provide two new ideas that improve the learning process. One that adaptively changes the “scale” of the problem and by doing so improves the guarantee on the margin of the output (Algorithm 3), and one that yields almost optimal margin (Algorithms 4).

(ii) Applications for neural networks (Section 3). Our variants of the perceptron algorithm are simple to implement, and can therefore be easily applied in the training process of general neural networks. We validate their benefits by training a basic neural network on the MNIST dataset.

(iii) Convex separation (Section 4). We use the perceptron’s compression to prove a sparse separation lemma for convex bodies. This perspective also suggests a different proof of Novikoff’s theorem on the perceptron’s convergence 

. In addition, we interpret this sparse separation lemma in the language of game theory as yielding sparse strategies in a related zero-sum game.

(iv) Generalization bounds (Section 5). An important aspect of a learning algorithm is its generalization capabilities; namely, its error on new examples that are independent of the training set (see the textbook  for background and definitions). We follow the theme of , and observe that even though the (original) perceptron algorithm does not yield an optimal hyperplane, it still generalizes.

(v) Robust concepts (Section 6). The robust concepts theme presented by Arriaga and Vempala  suggests focusing on well-separated data. We notice that the perceptron fits well into this framework; specifically, that its compression yields efficient dimension reductions. Similar dimension reductions were used in several previous works (e.g. [3, 5, 15, 16, 4, 21, 6]).

Summary. In parts (i)-(ii) we provide a couple of new ideas for improving the training process and explain their contribution in the context of previous work. In part (iii) we use the perceptron’s compression as a tool for proving geometric theorems. We are not aware of previous works that studied this connection. Parts (iv)-(v) are mostly about presenting ideas from previous works in the context of the perceptron’s compression. We think that parts (iv) and (v) help to understand the picture more fully.

## 2 Variants of the Perceptron

Deciding how to train a model from a list of input examples is a central consideration in any learning process. In the case of the perceptron algorithm the input examples are traversed while maintaining a hypothesis in a way that reduces the error on the current example:

Clearly, the perceptron algorithm terminates whenever its input sample is linearly separable, in which case its output represents a separating hyperplane. Novikoff analyzed the number of steps required for the perceptron to stop as a function of the margin of the input sample .

The standard analysis of the perceptron convergence properties uses the optimal separating hyperplane (later in Section 4 we present an alternative analysis that does not use it):

 w∗=argmaxw∈Rd:∥w∥=1marg(w,S),

where we think of as the map from to defined by .222We assume that is consistent with a function (does not contain identical points with opposite labels). Novikoff’s analysis consists of the following two parts. Let and .

Part I: The projection grows linearly in time. In each iteration, the projection of on grows by at least , since . By induction, we get for all .

Part II: The norm grows sub-linearly in time. In each iteration,

 ∥w(t)∥2=∥w(t−1)∥2+2yixi\vbox% \scalebox{.5}{∙}w(t−1)+∥xi∥2≤∥w(t−1)∥2+R2

(the term is negative by choice). So by induction for all .

Combining the two parts,

 1≥w(t)\vbox\scalebox{.5}{∙}w∗∥w(t)∥∥w∗∥≥ε∗R√t,

which implies that the number of iterations of the algorithm is at most .

As discussed in Section 1, Algorithm 1 has several drawbacks. Here we describe some simple ideas that allow to improve it. Below we describe three algorithms, each is followed by a theorem that summarizes its main properties.

In the following, is a finite set, is a linear partition, is the optimal margin, and is the maximal norm of a point.

In the first variant that already appeared in [9, 23], the suggestion is to replace the condition by for some a priori chosen . that may change over time. As we will see, different choices of yield different guarantees.

###### Theorem 2.1 ([9, 23]).

The -perceptron algorithm performs at most updates and achieves a margin of at least .

###### Proof.

We only replaced that condition in the while loop by a condition, for some . As before, by induction

 ∥w(t)∥2=∥w(t−1)∥2+2yixi\vbox% \scalebox{.5}{∙}w(t−1)+∥xi∥2≤(2β+R2)t

and

 1≥w(t)\vbox\scalebox{.5}{∙}w∗∥w(t)∥∥w∗∥≥ε∗√2β+R2√t

where . The number of iterations is thus at most . In addition, by choice, for all ,

 yiw(t)\vbox\scalebox{.5}{∙}xi≥β.

So, since

 ∥w(t)∥≤√(2β+R2)t≤2β+R2ε∗,

we get

 marg(w(t),S)≥βε∗2β+R2.

To remove the dependence on in the output’s margin above, we propose to rescale according to the observed examples.

###### Theorem 2.2.

The -independent perceptron algorithm performs at most updates and achieves a margin of at least .

###### Proof.

This version of the algorithm guarantees a margin of coupled with a running time comparable to the original algorithm without knowing . Indeed, to bound the running time, observe that before a change in occurs, there could be at most errors (as before for the relevant and ). The amount of changes in is at most , where . The overall running time is at most

 ⌈log(R/r)⌉∑k=12\vbox% \scalebox{.5}{∙}4∣∣xik∣∣2+(2∣∣xik∣∣)2(ε∗)2 ≤2\vbox\scalebox{.5}{∙}⌈log(R/r)⌉∑k=13\vbox\scalebox{.5}{∙}4kr2ε2 ≤6\vbox\scalebox{.5}{∙}4/3\vbox\scalebox{.5}{∙}4⌈log(R/r)⌉r2ε2=O((R/ε∗)2).

Finally, if one would like to improve upon the guarantee, we suggest to change with time. To run the algorithm, we should first decide how well do we want to approximate the optimal margin. To do so, we need to choose the parameter ; the closer is to , the better the approximation is (see Theorem 2.3).

###### Theorem 2.3.

If , the -perceptron algorithm performs at most updates and achieves a margin of at least .

###### Proof.

For simplicity, we assume here that . The idea is as follows. The analysis of the classical perceptron relies on the fact that in each step. On the other hand, in an “extremely aggressive” version of the perceptron that always updates, one can only obtain a trivial bound (as can be the sum of unit vectors in the same direction). The update rule in the version below is tailored so that a bound of for is maintained.

Here we use that for ,

 ∥w(t)∥2≤∥w(t−1)∥2+(tα−(t−1)α−1)+∥xi∥2.

By induction, for all ,

 ∥w(t)∥2≤tα.

This time

 1≥w(t)\vbox\scalebox{.5}{∙}w∗∥w(t)∥∥w∗∥≥ε∗ttα/2.

So, the running time is at most .

The output’s margin is at least

 0.5((t+1)α−tα−1)tα/2. (1)

This is decreasing function for , since its derivative is at most zero (see Appendix A).

Since for , the output’s margin is at least

 0.5α(1/ε∗)2(α−1)/(2−α)−1(1/ε∗)α/(2−α)=0.5αε∗−(ε∗)α/(2−α).

So we can get arbitrarily close to the true margin by setting for some small of our choice. This gives margin

 (1−δ)ε∗−(ε∗)(2−δ)/δ≥ε∗(1−δ−(ε∗)1/δ).

The running time, however, becomes .

When is very close to , the lower bound on the margin above may not be meaningful. We claim that the margin of the output is still close to even in this case. To see this, let be a hyperplane with margin . We can carry the argument above with instead of , and get that the margin is at least

 ~ε(1−δ−(~ε)1/δ)>(1−2δ−δln(1/δ))ε∗.

So we can choose small enough, without knowing any information on , and get an almost optimal margin.

###### Remark.

The bound on the running time is sharp, as the following example shows. Let . These two points are linearly separated with margin . The algorithm stops after iterations (if is small enough and close enough to ).

###### Remark.

Algorithms 3 and 4 can be naturally combined to a single algorithm that arrives arbitrarily close to the optimal margin without assuming that .

## 3 Application for Neural Networks

Our results explain some choices that are made in practice, and can potentially help to improve them. Observe that if one applies gradient descent on a neuron of the form with loss function of the form with then one gets the same update rule as in the perceptron algorithm. Choosing corresponds to using the hinge loss to drive the learning process. The fact that yields provable bounds on the output’s margin of a single neuron suggests a formal evidence that supports the benefits of the hinge loss.

Moreover, in practice, is treated as a hyper-parameter and tuning it is a common challenge that needs to be addressed in order to maximize performance. We proposed a couple of new options for choosing and updating throughout the training process that may contribute towards a more systematic approach for setting (see Algorithms 4 and 3). Theorems 2.2 and 2.3 explain the theoretical advantages of these options in the case of a single neuron.

We also provide some experimental data. Our experiments verify that our suggestions for choosing

can indeed yield better results. We used the MNIST database

 of handwritten digits as a test case with no preprocessing. We used a simple and standard neural network with one hidden layer consisting of 800/300 neurons and 10 output neurons (the choice of 800 and 300 is the same as in Simard et al.  and Lecun et al. ). We trained the network by back-propagation (gradient descent). The loss function of each output neurons of the form where is the output of the hidden layer is for different ’s. This loss function is if provides a correct and confident (depending on ) classification of and is linear in

otherwise. This choice updates the network even when the network classifies correctly but with less than

confidence. It has the added value of yielding simple and efficient calculations compared to other choices (like cross entropy or soft-max).333 An additional added value is that with this loss function there is a dichotomy, either an error occurred or not. This dichotomy can be helpful in making decisions throughout the learning process. For example, instead of choosing the batch-size to be of fixed size , we can choose the batch-size in a dynamic but simple way: just wait until errors occurred.

We tested four values of as shown in Figure 1. In two tests, the value of is fixed in time444Time is measured by the number of updates. to be and . In two tests, changes with the time in a sub-linear fashion. This choice can be better understood after reading the analysis of Algorithm 4. Roughly speaking, the analysis predicts that should be of the form for , and that the smaller is, the smaller the error will be. This prediction is indeed verified in the experiments; it is evident that choosing in a time-dependent manner yields better results. For comparison, the last row of the table shows the error of the two-layer MLP of the same size that is driven by the cross-entropy loss . In fact, our network of 300 neurons performed better than all the general purpose networks with 300 neurons even with preprocessing of the data that appear in http://yann.lecun.com/exdb/mnist/.

Finally, a natural suggestion that emerges from our work is to add as a parameter for each individual neuron in the network, and not just to the loss function. Namely, to translate the input to a neuron by . The value of may change during the learning process. Figuratively, this can be thought of as “internal clocks” of the neurons.

## 4 Convex Separation

Linear programming (LP) is a central paradigm in computer science and mathematics. LP duality is a key ingredient in many algorithms and proofs, and is deeply related to von Neumann’s minimax theorem that is seminal in game theory . Two related and fundamental geometric properties are Farkas’ lemma , and the following separation theorem.

###### Theorem 4.1 (Convex separation theorem).

For every non empty convex sets , precisely one of the following holds: (i) , or (ii) there is a hyperplane separating and .

We observe that the following stronger version of the separation theorem follows from the perceptron’s compression (a similar version of Farkas’ lemma can be deduced as well).

###### Lemma 4.2 (Sparse Separation).

For every non empty convex sets so that and every , one of the following holds:

1. [label=()]

2. .

3. There is a hyperplane separating from so that its normal vector is “sparse”:

for all ,

for all , and

is a sum of at most points in and .

###### Proof.

Let be convex sets and . For , let in be the same as in the first coordinates and in the last (we have ). We thus get two convex bodies and in dimensions (using the map ).

Run Algorithm 2 with on inputs that positively label and negatively label . This produces a sequence of vectors so that for all . For every , the vector is of the form where is a sum of elements of and is a sum of elements of so that . In particular, we can write for where and (note that the last coordinate of equals ).

If the algorithm does not terminate after steps for satisfying then it follows that . In particular, and so

 ε4 >∥α(t)p(t)−(1−α(t))q(t)∥>∥p(t)−q(t)∥2−ε4,

which implies that .

In the complementing case, the algorithm stops after rounds. Let be the first coordinates of and be its last coordinate. For all ,

 w\vbox\scalebox{.5}{∙}p+b∥w∥≥1∥w(T)∥≥1√6T>ε30.

Similarly, for all we get . ∎

The lemma is strictly stronger than the preceding separation theorem. Below, we also explain how this perspective yields an alternative proof of Novikoff’s theorem on the convergence of the perceptron . It is interesting to note that the usual proof of the separation theorem relies on a concrete construction of the separating hyperplane that is geometrically similar to hard-SVMs. The proof using the perceptron, however, does not include any “geometric construction” and yields a sparse and strong separator (it also holds in infinite dimensional Hilbert space, but it uses that the sets are bounded in norm).

### Alternative Proof of the Perceptron’s Convergence

Assume without loss of generality that all of examples are labelled positively (by replacing by if necessary). Also assume that . As in the proof above, let be the sequence of vectors generated by the perceptron (Algorithm 1). Instead of arguing that the projection on grows linearly with , argue as follows. The vectors defined by are in the convex hull of the examples and have norm at most . Specifically, for every of norm we have and so there is an example so that . This implies that the running time satisfies since for every example we have .

### A Game Theoretic Perspective

The perspective of game theory turned out to be useful in several works in learning theory (e.g. [13, 28]). The ideas above have a game theoretic interpretation as well. In the associated game there are two players. A Point player whose pure strategies are points in some finite set so that , and a Hyperplane player whose pure strategies are for with . For a given choice of and , the Hyperplane player’s payoff is of coins (if this number is negative, then the Hyperplane player pays the Point player). The goal of the Point player is thus to minimize the amount of coins she pays. A mixed strategy of the Point player is a distribution on , and of the Hyperplane player is a (finitely supported) distribution on . The expected gain is

 P(μ,κ)=missingE(v,w)∼μ×κP(v,w).
###### Claim 4.3 (Sparse Strategies).

Let be the minimax value of the game:

 ε∗=supκinfμP(μ,κ)≥0.

There is (if then ) and a sequence of mixed strategies of the Point player so that for all , the support size of is at most and for every mixed strategy of the Hyperplane player,

 P(μt,κ)≤√3/t.
###### Proof.

Let be as in the proof of Lemma 4.2 above, when we replace by and by . We can interpret as a mixed-strategy

of the Point player (the uniform distribution over some multi-subset of

of size ). Specifically, for every and ,

 P(μt,κ)=missingEw∼κv(t)\vbox\scalebox{.5}{∙}w≤∥v(t)∥≤√3/t.

Denote by the stopping time. If then indeed tends to zero as . If , we have for all . We can interpret as a non trivial strategy for the Hyperplane player: let

 ~w=v(T)∥v(T)∥.

Thus, for every ,

 P(μ,~w)≥1T∥v(T)∥≥1√3T.

In particular, and so

 T≥13(ε∗)2.

The last strategy in the sequence guarantees the Point player a loss of at most . This sequence is naturally and efficiently generated by the perceptron algorithm and produces a strategy for the Point player that is optimal up to a constant factor. The ideas presented in Section 2 allow to reduce the constant to as close to as we want, by paying in running time (see Algorithm 4).

## 5 Generalization Bounds

Generalization is one of the key concepts in learning theory. One typically formalizes it by assuming that the input sample consists of i.i.d. examples drawn from an unknown distribution on that are labelled by some unknown function . The algorithm is said to generalize if it outputs an hypothesis so that is as small as possible.

We focus on the case that is linearly separable. A natural choice for in this case is given by hard-SVM; namely, the halfspace with maximum margin on the input sample. It is known that if is supported on points that are -far from some hyperplane then the hard-SVM choice generalizes well (see Theorem 15.4 in ). The proof of this property of hard-SVMs uses Rademacher complexity.

We suggest that using the perceptron algorithm, instead of the hard-SVM solution, yields a more general statement with a simpler proof. The reason is that the perceptron can be interpreted as a sample compression scheme.

###### Theorem 5.1 (similar to ).

Let be a distribution on . Let . Let be i.i.d. samples from . Let . If

 PrS[marg(S)<ε]<δ/2 (2)

for some , then

 PrS[PD[π(S)≠c]≤50log(ε2m)+log(2/δ)ε2m]≥1−δ

where is the perceptron algorithm.

The theorem can also be interpreted of as a local-to-global statement in the following sense. Assume that we know nothing of , but we get a list of samples that are linearly separable with significant margin (this is a local condition that we can empirically verify). Then we can deduce that is close to being linearly separable. The perceptron’s compression allows to deduce more general local-to-global statements, like bounding the global margin via the local/empirical margins (this is related to ).

Condition (2) holds when the expected value of one over the margin is bounded from above (and may hold when is not linearly separable). This assumption is weaker than the assumption in  on the behavior of hard-SVMs (that the margin is always bounded from below).

For the proof of Theorem 5.1 we will need the following.

###### Definition (Selection schemes).

A selection scheme of size consists of a compression map and a reconstruction map such that for every input sample :

• maps to a sub-sample of of size at most .

• maps to a hypothesis ; this is the output of the learning algorithm induced by the selection scheme.

Following Littlestone and Warmuth, David et al. showed that every selection scheme does not overfit its data : Let be a selection scheme of size . Let be a sample of independent examples from an arbitrary distribution that are labelled by some fixed concept , and let be the output of the selection scheme. For a hypothesis , let denote the true error of and denote the empirical error of .

###### Theorem 5.2 ().

For every ,

 PrS[|LD(K(S))−LS(K(S))|≥√ε⋅LS(K(S))+ε]≤δ,

where

 ε=50dlog(m/d)+log(1/δ)m.
###### Proof of Theorem 5.1.

Consider the following selection scheme of size that agrees with the perceptron on samples with margin at least : If the input sample has , apply the perceptron (which gives a compression of size ). Else, compress it to the emptyset and reconstruct it to some dummy hypothesis. The theorem now follows by applying Theorem 5.2 on this selection scheme and by the assumption that that for of the space (note that when ). ∎

## 6 Robust Concepts

Here we follow the theme of robust concepts presented by Arriaga and Vempala . Let be of size so that . Think of as representing a collection of high resolution images. As in many learning scenarios, some assumptions on the learning problem should be made in order to make it accessible. A typical assumption is that the unknown function to be learnt belongs to some specific class of functions. Here we focus on the class of all -separated partitions of ; these are functions that are linearly separable with margin at least . Such partitions are called robust concepts in  and correspond to “easy” classification problems.

Arriaga and Vempala demonstrated the difference between robust concepts and non-robust concept with the following analogy; it is much easier to distinguish between “Elephant” and “Dog” than between “African Elephant” and “Indian Elephant.” They proved that random projections can help to perform efficient dimension reduction for -separated learning problems (and more general examples). They also described “neuronal” devices for performing it, and discussed their advantages. Similar dimension reductions were used in several other works in learning e.g. [15, 16, 4, 21, 6].

We observe that the perceptron’s compression allows to deduce a simultaneous dimension reduction. Namely, the dimension reduction works simultaneously for the entire class of robust concepts. This follows from results in Ben-David et al. , who studied limitations of embedding learning problems in linearly separated classes.

We now explain this in more detail. The first step in the proof is the following theorem.

###### Theorem 6.1 ().

The number of -separated partitions of is at most .

###### Proof.

Given an -partition of the set , the perceptron algorithm finds a separating hyperplane after making at most updates. It follows that every -partition can be represented by a multiset of together with the corresponding signs. The total number of options is at most . ∎

The theorem is sharp in the following sense.

###### Example 6.2.

Let be the standard unit vectors. Every subset of the form for of size is -separated, and there are such subsets.

The example also allows to lower bound the number of updates of any perceptron-like algorithm. If there is an algorithm that given of margin is able to find so that for that can be described by at most of the points in then should be at least .

The upper bound in the theorem allows to perform dimension reduction that simultaneously works well on the entire concept class. Let be a

matrix with i.i.d. entries that are normally distributed (

)555Other distributions will work just as well. with where is an absolute constant. Given , we can consider

 AX={Ax:x∈X}⊂Rk

in a potentially smaller dimension space. The map is almost surely one-to-one on . So, every subset of corresponds to a subset of and vice versa. The following theorem shows that it preserves all well-separated partitions.

###### Theorem 6.3 (implicit in ).

With probability of at least

over the choice of , all -partitions of are -partitions of and all -partitions of are -partitions of .

The proof of the above theorem is a simple application of Theorem 6.1 together with the Johnson-Lindenstrauss lemma.

###### Lemma 6.4 ().

Let with for all . Then, for every and ,

 P[∃i,j∈[N] ∣∣(Axi\vbox\scalebox{.5}{∙}Axj)−(xi\vbox\scalebox{.5}{∙}xj)∣∣>ε]<δ,

where and is a matrix with i.i.d. entries that are .

## References

•  A. Andoni, R. Panigrahy, G. Valiant and L. Zhang. Learning Polynomials with Neural Networks. PMLR 32(2), pages 1908–1916, 2014.
•  J. K. Anlauf and M. Biehl. The AdaTron: An Adaptive Perceptron Algorithm. EPL, 1989.
•  R.I. Arriaga and S. Vempala. An algorithmic theory of learning: Robust concepts and random projection. Machine Learning, 63(2), pages 161–182, 2006.
•  N. Balcan, A. Blum and S.Vempala. On Kernels, Margins and Low-dimensional mappings. In ALT 2004.
•  S. Ben-David, N. Eiron and H. U. Simon. Limitations of Learning Via Embeddings in Euclidean Half Spaces. In JMLR 2002.
•  A. Blum and R. Kannan. Learning an intersection of k halfspaces over a uniform distribution. In FOCS, 1993.
•  B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In COLT , pages 144-152, 1992.
•  Nicol‘o Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9), pages 2050–2057, 2004.
•  R. Collobert and S. Bengio. Links between perceptrons, MLPs and SVMs. IDIAP, 2004.
•  K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive-aggressive algorithms. Journal of Machine Learning Research 7, pages 551–585, 2006.
•  O. David, S. Moran and A. Yehudayoff. Supervised learning through the lens of compression. In NIPS, pages 2784-2792, 2016.
•  G. Farkas. Uber die Theorie der Einfachen Ungleichungen. Journal fur die Reine und Angewandte Mathematik, 124 (124), pages 1–27, 1902.
•  Y. Freund. Boosting a weak learning algorithm by majority. Information and computation 121 (2), pages 256–285, 1995.
•  Y. Freund and R. E. Schapire. Large Margin Classification Using the Perceptron Algorithm. Machine Learning, pages 277-296, 1999.
•  A. Garg, S. Har-Peled and D. Roth. On generalization bounds, projection profile, and margin distribution. In ICML, pages 171–178, 2002.
•  A. Garg and D. Roth. Margin Distribution and Learning. In ICML, pages 210–217, 2003.
•  C. Gentile. A New Approximate Maximal Margin Classification Algorithm. Journal of Machine Learning Research, pages 213-242, 2001.
•  T. Graepel, R. Herbrich and J. Shawe-Taylor. PAC-Bayesian Compression Bounds on the Prediction Error of Learning Algorithms for Classification. Machine Learning, pages 55-76, 2005.
•  W. B. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Conference in modern analysis and probability, 1982.
•  R. Khardon and G. Wachman. Noise Tolerant Variants of the Perceptron Algorithm. Journal of Machine Learning Research, pages 227-248 , 2007.
•  A. Klivans and R. Servedio. Learning intersections of halfspaces with a margin. In

Workshop on Computational Learning Theory

, 2004.
•  M. Korzen and K. Klesk.

Maximal Margin Estimation with Perceptron-Like Algorithm.

In ICAISC, 2008.
•  W. Krauth and M. Mézard. Learning algorithms with optimal stablilty in neural networks. J. Phys. A: Math. Gen., 1987.
•  Y. LeCun and C. Cortes. The MNIST database of handwritten digits. 1998.
•  N. Littlestone and M. Warmuth. Relating data compression and learnability. Unpublished, 1986.
•  J. Matousek. On variants of the Johnson–Lindenstrauss lemma. Random Structures & Algorithms, 33(2), pages 142–156, 2008.
•  M. Mohri and A. Rostamizadeh. Perceptron Mistake Bounds. arXiv:1305.0208.
•  S. Moran and A. Yehudayoff. Sample compression schemes for VC classes. JACM 63 (3), pages 1–21, 2016.
•  J. von Neumann. Zur Theorie der Gesellschaftsspiele. Math. Ann. 100, pages 295–320, 1928.
•  Albert B.J. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, volume 12, pages 615–622, 1962.
•  F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), pages 386–408, 1958.
•  R.E. Schapire, Y. Freund, P. Bartlett and W. S. Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The annals of statistics, 26(5), pages 1651–1686, 1998.
•  S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algorithms. Cambridge University Press, 2014.
•  S. Shalev-Shwartz, Y. Singer, N. Srebro and A. Cotter. Pegasos: Primal estimated sub-gradient solver for SVM. Mathematical programming 127, no. 1, pages 3-30, 2011.
•  P. Y. Simard, D. Steinkraus and J. C. Platt.

Best practices for convolutional neural networks applied to visual document analysis.

In ICDAR 3, pages 958–962, 2003.
•  D. Soudry, E. Hoffer and N. Srebro. The Implicit Bias of Gradient Descent on Separable Data. arXiv:1710.10345, 2017.
•  A. Wendemuth. Learning the unlearnable. J. Phys. A: Math. Gen., 1995.

## Appendix A The derivative of the margin

Here we prove that the derivative of (1) is at most zero. The numerator of the derivative is times

 (α(t+1)α−1−αtα−1)tα/2−α2t(α−2)/2((t+1)α−tα−1)) =α2t(α−2)/2(2t(t+1)α−1−2tα)+α2t(α−2)/2(−(t+1)α+tα+1)) =αt(α−2)/22((t+1)α−1(t−1)−tα+1).

At , we get the value , so it suffices to prove that is a non increasing function for . Indeed, the derivative of the term inside the parenthesis is

 (α−1)(t+1)α−2(t−1)+(t+1)α−1−αtα−1 =(α−1)(t−1(t+1)2−α−tα−1)+(t+1)α−1−tα−1 ≤(α−1)(t−1(t+1)2−α−tα−1)+(α−1)tα−2 (α<2) ≤(α−1)(t−1t2−α−tα−1+1t2−α)=0.