1. Introduction
Building a theory that can help to understand neural networks and guide their construction is one of the current challenges of machine learning. Here we wish to shed some light on the role symmetry plays in the construction of neural networks. It is wellknown that symmetry can be used to enhance the performance of neural networks. For example, convolutional neural networks (CNNs) (see
[Lecun et al.(1998)]) use the translational symmetry of images to classify images better than fully connected neural networks. Our focus is on the role of symmetry in the initialization stage. We show that symmetrybased initialization can be the difference between failure and success.
On a highlevel, the study of neural networks can be partitioned to three different aspects.
 Expressiveness:

Given an architecture, what are the functions it can approximate well?
 Training:

Given a network with a “proper” architecture, can the network fit the training data and in a reasonable time?
 Generalization:

Given that the training seemed successful, will the true error be small as well?
We study these aspects for the first “non trivial” case of neural networks, networks with one hidden layer. We are mostly interested in the initialization phase. If we take a network with the appropriate architecture, we can always initialize it to the desired function. A standard method (that induces a non trivial learning problem) is using random weights to initialize the network. A different reasonable choice is to require the initialization to be useful for an entire class of functions. We follow the latter option.
Our focus is on the role of symmetry. We consider the following class of symmetric functions
where and . The functions in this class are invariant under arbitrary permutations of the input’s coordinates. The parity function and the majority function are wellknown examples of symmetric functions.
Expressiveness for this class was explored by [Minsky and Papert(1988)]
. They showed that the parity function cannot be represented using a network with limited “connectivity”. Contrastingly, if we use a fully connected network with one hidden layer and a common activation function (like
, , or ) only neurons are needed. We provide such explicit representations for all functions in ; see Lemmas 1 and 2.We also provide useful information on both the training phase and generalization capabilities of the neural network. We show that, with proper initialization, the training process (using standard SGD) efficiently converges to zero empirical error, and that consequently the network has small true error as well.
Theorem 1.
There exists a constant so that the following holds. There exists a network with one hidden layer, neurons with or activations, and an initialization such that for all distributions over and all functions with sample size , after performing SGD updates with a fixed step size it holds that
where and is the network after training over .
The number of parameters in the network described in Theorem 1 is . So in general one could expect overfitting when the sample size is as small as . Nevertheless, the theorem provides generalization guarantees, even for such a small sample size.
The initialization phase plays an important role in proving Theorem 1. To emphasize this, we report an empirical phenomenon (this is “folklore”). We show that a network cannot learn parity from a random initialization (see Section 4.3). On one hand, if the network size is big, we can bring the empirical error to zero (as suggested in [Soudry and Carmon(2016)]), but the true error is close to . On the other hand, if its size is too small, the network is not even able to achieve small empirical error (see Figure 5). We observe a similar phenomenon also for a random symmetric function. An open question remains: why is it true that a sample of size polynomial in does not suffice to learn parity (with random initialization)?
A similar phenomenon was theoretically explained by [Shamir(2016)] and [Song et al.(2017)]. The parity function belongs to the class of all parities
where is the standard inner product. This class is efficiently PAClearnable with samples using Gaussian elimination. A continuous version of was studied by [Shamir(2016)] and [Song et al.(2017)]. To study the training phase, they used a generalized notion of statistical queries (SQ); see [Kearns(1998)]. In this framework, they show that most functions in the class
cannot be efficiently learned (roughly stated, learning the class requires an exponential amount of resources). This framework, however, does not seem to capture actual training of neural networks using SGD. For example, it is not clear if one SGD update corresponds to a single query in this model. In addition, typically one receives a dataset and performs the training by going over it many times, whereas the query model estimates the gradient using a fresh batch of samples in each iteration. The query model also assumes the noise to be adversarial, an assumption that does not necessarily hold in reality. Finally, the SQbased lower bound holds for every initialization (in particular, for the initialization we use here), so it does not capture the efficient training process Theorem
1 describes.Theorem 1 shows, however, that with symmetrybased initialization, parity can be efficiently learned. So, in a nutshell, parity can not be learned as part of , but it can be learned as part of . One could wonder why the hardness proof for cannot be applied for as both classes consist of many input sensitive functions. The answer lies in the fact that has a far bigger statistical dimension than (all functions in are orthogonal to each other, unlike ).
The proof of the theorem utilizes the different behavior of the two layers in the network. SGD is performed using a step size that is polynomially small in . The analysis shows that in a polynomial number of steps that is independent of the choice of the following two properties hold: (i) the output neuron reaches a “good” state and (ii) the hidden layer does not change in a “meaningful” way. These two properties hold when is small enough. In Section 4.2, we experiment with large values of . We see that, although the training error is zero, the true error becomes large.
Here is a high level description of the proof. The neurons in the hidden layer define an “embedding” of the inputs space into (a.k.a. the feature map). This embedding changes in time according to the training examples and process. The proof shows that if at any point in time this embedding has good enough margin, then training with standard SGD quickly converges. This is explained in more detail in Section 3. It remains an interesting open problem to understand this phenomenon in greater generality, using a cleaner and more abstract language.
1.1. Background
To better understand the context of our research, we survey previous related works.
The expressiveness and limitations of neural networks were studied in several works such as [Rahimi and Recht(2008), Telgarsky(2016), Eldan and Shamir(2016)] and [Arora et al.(2016)]. Constructions of small networks for the parity function appeared in several previous works, such as [Wilamowski et al.(2003)], [Arslanov et al.(2016)], [Arslanov et al.(2002)] and [Masato Iyoda et al.(2003)]. Constant depth circuits for the parity function were also studied in the context of computational complexity theory, see for example [Furst et al.(1981)], [Ajtai(1983)] and [Håstad(1987)].
The training phase of neural networks was also studied in many works. Here we list several works that seem most related to ours. [Daniely(2017)] analyzed SGD for general neural network architecture and showed that the training error can be nullified, e.g., for the class of bounded degree polynomials (see also [Andoni et al.(2014)]). [Jacot et al.(2018)] studied neural tangent kernels (NTK), an infinite width analogue of neural networks. [Du et al.(2018)] showed that randomly initialized shallow networks nullify the training error, as long as the number of samples is smaller than the number of neurons in the hidden layer. Their analysis only deals with optimization over the first layer (so that the weights of the output neuron are fixed). [Chizat and Bach(2018)] provided another analysis of the latter two works. [AllenZhu et al.(2018b)] showed that overparametrized neural networks can achieve zero training error, as as long as the data points are not too close to one another and the weights of the output neuron are fixed. [Zou et al.(2018)] provided guarantees for zero training error, assuming the two classes are separated by a positive margin.
Convergence and generalization guarantees for neural networks were studied in the following works. [Brutzkus et al.(2017)] studied linearly separable data. [Li and Liang(2018)] studied well separated distributions. [AllenZhu et al.(2018a)] gave generalization guarantees in expectation for SGD. [Arora et al.(2019)] gave datadependent generalization bounds for GD. All these works optimized only over the hidden layer (the output layer is fixed after initialization).
Margins play an important role in learning, and we also use it in our proof. [Sokolic et al.(2016)], [Sokolic et al.(2017)], [Bartlett et al.(2017)] and [Sun et al.(2015)] gave generalization bounds for neural networks that are based on their margin when the training ends. From a practical perspective, [Elsayed et al.(2018)], [Romero and Alquezar(2002)] and [Liu et al.(2016)] suggested different training algorithms that optimize the margin.
As discussed above, it seems difficult for neural networks to learn parities. [Song et al.(2017)] and [Shamir(2016)] demonstrated this using the language statistical queries (SQ). This is a valuable language, but it misses some central aspects of training neural networks. SQ seems to be closely related to GD, but does not seem to capture SGD. SQ also shows that many of the parities functions are difficult to learn, but it does not imply that the parity function is difficult to learn. [Abbe and Sandon(2018)] demonstrated a similar phenomenon in a setting that is closer to the “real life” mechanics of neural networks.
We suggest that taking the symmetries of the learning problem into account can make the difference between failure and success. Several works suggested different neural architectures that take symmetries into account; see [Zaheer et al.(2017)], [Gens and Domingos(2014)], and [Cohen and Welling(2016)].
2. Representations
Here we describe efficient representations for symmetric functions by network with one hidden layer. These representations are also useful later on, when we study the training process. We study two different activation functions, and (similar statement can be proved for other activations, like ). Each activation function requires its own representation, as in the two lemmas below.
2.1. Sigmoid
We start with the activation , since it helps to understand the construction for the activation. The building blocks of the symmetric functions are indicators of for . An indicator function is essentially a sum of two functions:
where .
Lemma 1.
The symmetric function satisfies .
A network with one hidden layer of neurons with activations is sufficient to represent any symmetric function.
Proof.
For all and of weight ,
the first inequality holds since for all and . For all and of weight ,
the first equality follows from the definition, the first inequality neglects the negative sums, and the second inequality follows because for all .
∎
2.2. ReLU
An indicator function can be represented using as , where
A natural idea is to take a linear combination (similarly to the ) to get general functions in . However, this fails because the function is unbounded. The following lemma states the needed correction.
Lemma 2.
Let for . Define . The symmetric function
can be represented as
The lemma shows that a network with one hidden layer of neurons is sufficient to represent any function in . The coefficient of the gates are in this representation.
Proof.
The proof proceeds in two parts. The first part shows the function is constant for all so that . The second part shows that this function equals for all so that and that it is negative for all that satisfy .
For the first part, denote by the value of the symmetric function on inputs of weight . By induction, assume that for some . Think of as a univariate function of the real variable . This function is differentiable for all :
the first equality follows from the definition of , the second equality follows from the definition of the function, and the last equality holds since the first and third sum cancel each other and the second and fourth sum as well. In a similar manner, for all , we have . So, integrating over concludes the induction .
For the second part, we start by proving that . Let . By definition, . For , we have
Induction on can be used to prove that . Now, by the derivatives calculated in the first part, for it holds that .
∎
3. Training and Generalization
The goal of this section is to describe a small network with one hidden layer that (when initialized property) efficiently learns symmetric functions using a small number of examples (the training is done via SGD).
3.1. Specifications
Here we specify the architecture, initialization and loss function that is implicit in our main result (Theorem
1).To guarantee convergence of SGD, we need to start with “good” initial conditions. The initialization we pick depends on the activation function it uses, and is chosen with resemblance to Lemma 2 for . On a high level, this indicates that understanding the class of functions we wish to study in term of “representation” can be helpful when choosing the architecture of a neural network in a learning context.
The network we consider has one hidden layer. We denote by the weight between coordinate of the input and neuron in the hidden layer. We denote this matrix of weights. We denote by the bias of neuron of the hidden layer. We denote
this vector of weights. We denote by
is the weight from neuron in the hidden layer to the output neuron. We denote this vector of weights. We denote by the bias of the output neuron.Initialize the network as follows: The dimensions of are . For all and , we set
We set and .
To run SGD, we need to choose a loss function. We use the hinge loss,
where is the output of the hidden layer on input and is a parameter of confidence.
3.2. Margins
A key property in the analysis is the ‘margin’ of the hidden layer with respect to the function being learned.
A map over a finite set is linearly^{1}^{1}1A standard “lifting” that adds a coordinate with to every vector allows to translate the affine case to the linear case. separable if there exists such that for all . When the Euclidean norm of is , the number is the margin of with respect to . The number is the margin of .
We are interested in the following set in . Recall that is the weight matrix between the input layer and the hidden layer, and that
is the relevant bias vector. Given
, we are interested in the set , where . In words, we think of the neurons in the hidden layer as defining an “embeding” of in Euclidean space. A similar construction works for other activation functions. We say that agrees with if for all it holds that .The following lemma bounds from below the margin of the initial .
Lemma 3.
If is a partition that agrees with some function in for the initialization described above then .
3.3. Freezing the Hidden Layer
Before analyzing the full behavior of SGD, we make an observation: if the weights of the hidden layer are fixed with the initialization described above, then Theorem 1 holds for SGD with batch size . This observation, unfortunately, does not suffice to prove Theorem 1. In the setting we consider, the training of the neural network uses SGD without fixing any weights. This more general case is handled in the next section. The rest of this subsection is devoted for explaining this observation.
showed that that the perceptron algorithm
[Rosenblatt(1958)] makes a small number of mistakes for linearly separable data with large margin. For a comprehensive survey of the perceptron algorithm and its variants, see [Moran et al.(2018)].Running SGD with the hinge loss induces the same update rule as in a modified perceptron algorithm, Algorithm 1.
Novikoff’s proof can be generalized to any and batches of any size to yield the following theorem; see [Collobert and Bengio(2004), Krauth and Mezard(1987)] and appendix A.
Theorem 2.
For with margin and step size , the modified perceptron algorithm performs at most updates and achieves a margin of at least , where .
So, when the weights of the hidden layer are fixed, Lemma 3 implies that the number of SGD steps is at most polynomial in .
3.4. Stability
When we run SGD on the entire network, the layers interact. For a network at time , the update rule for is as follows. If the network classifies the input correctly with confidence more than , no change is made. Otherwise, we change the weights in by , where is the true label and is the step size. If also neuron of the hidden fired on , we update its incoming weights by . These update rules define the following dynamical system:
(1)  
(2)  
(3)  
(4) 
where is the Heaviside step function and is the Hadamard pointwise product.
A key observation in the proof is that the weights of the last layer ((3) and (4)) are updated exactly as the modified perceptron algorithm. Another key statement in the proof is that if the network has reached a good representation of the input (i.e., the hidden layer has a large margin), then the interaction between the layers during the continued training does not impair this representation. This is summarized in the following lemma (we are not aware of a similar statement in the literature).
Lemma 4.
Let , , and be a linearly separable embedding of and with margin by the hidden layer of a neural network of depth two with activation and weights given by . Let , let , and be the integration step. Assuming and , and using in the loss function, after SGD iterations the following hold:

Each moves a distance of at most .

The norm is at most .

The training ends in at most SGD updates.
Intuitively, this type of lemma can be useful in many other contexts. The high level idea is to identify a “good geometric structure” that the network reaches and enables efficient learning.
Proof.
We are interested in the maximal distance the embedding of an element has moved from its initial embedding:
(5)  
(6)  
(7) 
To simplify equations (1)(4) discussed above, we assume that during the optimization process the norm of the weights and grow at a maximal rate:
(8)  
(9) 
here the norm of a matrix is the norm.
To bound these quantities, we follow the modified perceptron proof and add another quantity to bound. That is, the maximal norm of the embedded space at time satisfies (by assumption )
we used that the spectral norm of a matrix is at most its norm.
We assume a worstcase where grows monotonically at a maximal rate. By the modified perceptron algorithm and choice ,
By choice of and assuming ,
Solving the above recursive equation, it holds for all ,
Now, summing equation 7, we have
since .
So in updates, the elements embedded by the network travelled at most . Hence, the samples the network received kept a margin of during training (by the assumption ). By choice of the loss function, SGD changes the output neuron as in the modified perceptron algorithm. By Theorem 2, the number of updates is at most . So, the assumption on we made during the proof holds.
∎
3.5. Main Result
Proof of Theorem 1.
There is an unknown distribution over the space . We pick i.i.d. examples where according to , where for some . Run SGD for steps, where the step size is and the parameter of the loss function is with .
We claim that it suffices to show that at the end of the training (i) the network correctly classifies all the sample points , and (ii) for every such that there exists with , the network outputs on as well. Here is why. The initialization of the network embeds the space into dimensional space (including the bias neuron of the hidden layer). Let be the initial embedding . Although , the size of is . The VC dimension of all the boolean functions over is . Now, samples suffice to yield true error for an ERM when the VC dimension is ; see e.g. Theorem 6.7 in [ShalevShwartz and BenDavid(2014)]. It remains to prove (i) and (ii) above.
By Lemma 3, at the beginning of the training, the partition of defined by the target has a margin of . We are interested in the eventual embedding of as well. The modified perceptron algorithm guarantees that after updates, () separates the embedded sample with a margin of at least . This happens as long as the updates we perform come from a set with maximal norm and with margin at least . This is guaranteed by Lemma 4 and concludes the proof of (i).
It remains to prove (ii). Lemma 4 states that as long as less than updates were made, the elements in moved at most . At the end of the training, the embedded sample is separated with a margin of at least
with respect to the hyperplane defined by
and . Each for moved at most . This means that if then the network has the same output on and . Since the network has zero empirical error, the output on this is as well.A similar proof is available with activation (with better convergence rate and larger allowed step size).
∎
Remark.
The generalization part of the above proof can be viewed as a consequence of sample compression ([Littlestone and Warmuth(1986)]). Although the eventual network depends on all examples, the proof shows that its functionality depends on at most examples. Indeed, after the training, all examples with equal hamming weight have the same label.
4. Experiments
We accompany the theoretical results with some experiments. We used a network with one hidden layer of neurons, activation, and the hinge loss with
. In all the experiments, we used SGD with minibatch of size one and before each epoch we randomized the sample. The graphs present the training error and the true error
^{2}^{2}2We deal with high dimensional spaces, so the true error was not calculated exactly but approximated on an independent batch of samples of size . versus the epoch of the training process. In all the comparisons below, we chose a random symmetric function and a random sample from .4.1. The Theory in Practice
Figure 2 demonstrates our theoretical results and also validates the performance of our initialization. In one setting, we trained only the second layer (freezed the weights of the hidden layer) which essentially corresponds to the perceptron algorithm. In the second setting, we trained both layers with a step size (as the theory suggests). As expected, performance in both cases is similar. We remark that SGD continues to run even after minimizing the empirical error. This happens because of the parameter .
4.2. Overstepping the Theory
Here we experiment with two parameters in the proof, the step size and the confidence parameter .
In Figure 3, we used three different step sizes, two of which much larger than the theory suggests. We see that the training error converges much faster to zero, when the step size is larger. This fast convergence comes at the expense of the true error. For a large step size, generalization cease to hold.
Setting is a construct in the proof. Figure 4 shows that setting does not impair the performance. The difference between theory (requires ) and practice (allows ) can be explained as follows. The proof bounds the worstcase movement of the hidden layer, whereas in practice an averagecase argument suffices.
4.3. Hard to Learn Parity
Figure 5 shows that even for , learning parity is hard from a random initialization. When the sample size is small the training error can be nullified but the true error is large. As the sample grows, it becomes much harder for the network to nullify even the training error. With our initialization, both the training error and true error are minimized quickly. Figure 6 demonstrates the same phenomenon for a random symmetric function.
4.4. Corruption of Data
Our initialization also delivers satisfying results when the input data it corrupted. In figure 7
, we randomly perturb (with probability
) the labels and use the same SGD to train the model. In figure 8, we randomly shift every entry of the vectors in the space bythat is uniformly distributed in
.5. Conclusion
This work demonstrates that symmetries can play a critical role when designing a neural network. We proved that any symmetric function can be learned by a shallow neural network, with proper initialization. We demonstrated by simulations that this neural network is stable under corruption of data, and that the small step size is the proof is necessary.
We also demonstrated that the parity function or a random symmetric function cannot be learned with random initialization. How to explain this empirical phenomenon is still an open question. The works [Shamir(2016)] and [Song et al.(2017)] treated parities using the language of SQ. This language obscures the inner mechanism of the network training, so a more concrete explanation is currently missing.
We proved in a special case that the standard SGD training of a network efficiently produces low true error. The general problem that remains is proving similar results for general neural networks. A suggestion for future works is to try to identify favorable geometric states of the network that guarantee fast convergence and generalization.
Acknowledgements
We wish to thank Adam Klivans for helpful comments.
References

[Abbe and Sandon(2018)]
Emmanuel Abbe and Colin Sandon.
Provable limitations of deep learning, 2018.
 [Ajtai(1983)] M. Ajtai. 11formulae on finite structures. Annals of Pure and Applied Logic, 24(1), pages 1–48, 1983.
 [AllenZhu et al.(2018a)] Zeyuan AllenZhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. CoRR, abs/1811.04918, 2018a.
 [AllenZhu et al.(2018b)] Zeyuan AllenZhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via overparameterization. CoRR, abs/1811.03962, 2018b.
 [Andoni et al.(2014)] Alexandr Andoni, Rina Panigrahy, Gregory Valiant, and Li Zhang. Learning polynomials with neural networks. In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 1908–1916, 2014.
 [Arora et al.(2016)] Raman Arora, Amitabh Basu, Poorya Mianjy, and Anirbit Mukherjee. Understanding deep neural networks with rectified linear units. CoRR, abs/1611.01491, 2016.
 [Arora et al.(2019)] Sanjeev Arora, Simon S. Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Finegrained analysis of optimization and generalization for overparameterized twolayer neural networks. CoRR, abs/1901.08584, 2019.
 [Arslanov et al.(2016)] Marat Arslanov, Zhazira E. Amirgalieva, and Chingiz A. Kenshimov. Nbit parity neural networks with minimum number of threshold neurons. Open Engineering, 6, 01 2016.
 [Bartlett et al.(2017)] Peter Bartlett, Dylan J. Foster, and Matus Telgarsky. Spectrallynormalized margin bounds for neural networks, 2017.
 [Brutzkus et al.(2017)] Alon Brutzkus, Amir Globerson, Eran Malach, and Shai ShalevShwartz. SGD learns overparameterized networks that provably generalize on linearly separable data. In ICLR, 2018.
 [Chizat and Bach(2018)] Lenaic Chizat and Francis Bach. A note on lazy training in supervised differentiable programming, 12, 2018.
 [Cohen and Welling(2016)] Taco S. Cohen and Max Welling. Group equivariant convolutional networks, 2016.
 [Collobert and Bengio(2004)] Ronan Collobert and Samy Bengio. Links between perceptrons, mlps and svms. In Proceedings of the Twentyfirst International Conference on Machine Learning, ICML ’04, page 23, 2004.
 [Daniely(2017)] Amit Daniely. Sgd learns the conjugate kernel class of the network. In Advances in Neural Information Processing Systems 30, pages 2422–2430, 2017.
 [Du et al.(2018)] Simon S. Du, Xiyu Zhai, Barnabás Póczos, and Aarti Singh. Gradient descent provably optimizes overparameterized neural networks. CoRR, abs/1810.02054, 2018.
 [Eldan and Shamir(2016)] Ronen Eldan and Ohad Shamir. The Power of Depth for Feedforward Neural Networks. In JMLR 49, pages 1–34, 2016.
 [Elsayed et al.(2018)] Gamaleldin F. Elsayed, Dilip Krishnan, Hossein Mobahi, Kevin Regan, and Samy Bengio. Large margin deep networks for classification. In NIPS, pages 850–860, 2018.
 [Furst et al.(1981)] Merrick Furst, James B. Saxe, and Michael Sipser. Parity, circuits, and the polynomialtime hierarchy. In FOCS, pages 260–270, 1981.
 [Gens and Domingos(2014)] Robert Gens and Pedro M Domingos. Deep symmetry networks. In Advances in Neural Information Processing Systems 27, pages 2537–2545, 2014.
 [Håstad(1987)] Johan Håstad. Computational Limitations of Smalldepth Circuits. MIT Press, Cambridge, MA, USA, 1987.
 [Jacot et al.(2018)] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In NIPS, pages 8580–8589, 2018.
 [Kearns(1998)] Michael Kearns. Efficient noisetolerant learning from statistical queries. J. ACM, 45 (6), pages 983–1006, 1998.
 [Krauth and Mezard(1987)] Werner Krauth and Marc Mezard. Learning algorithms with optimal stability in neural networks. J. Phys., A20, pages L745–L752, 1987.
 [Lecun et al.(1998)] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86 (11), pages 2278–2324, 1998.

[Li and Liang(2018)]
Yuanzhi Li and Yingyu Liang.
Learning overparameterized neural networks via stochastic gradient descent on structured data, 2018.
 [Littlestone and Warmuth(1986)] Nick Littlestone and Manfred K. Warmuth. Relating data compression and learnability. Technical report, 1986.
 [Liu et al.(2016)] Weiyang Liu, Yandong Wen, Zhiding Yu, and Meng Meng Yang. Largemargin softmax loss for convolutional neural networks. In ICML, 2016.
 [Masato Iyoda et al.(2003)] Eduardo Masato Iyoda, Hajime Nobuhara, and Kaoru Hirota. A solution for the nbit parity problem using a single translated multiplicative neuron. Neural Processing Letters, 18:233–238, 12 2003.
 [Minsky and Papert(1988)] Marvin L. Minsky and Seymour A. Papert. Perceptrons: Expanded Edition. MIT Press, Cambridge, MA, USA, 1988.
 [Moran et al.(2018)] Shay Moran, Ido Nachum, Itai Panasoff, and Amir Yehudayoff. On the perceptron’s compression. CoRR, abs/1806.05403, 2018.
 [Novikoff(1962)] Albert B.J. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, volume 12, pages 615–622, 1962.
 [Rahimi and Recht(2008)] Ali Rahimi and Benjamin Recht. Random features for largescale kernel machines. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1177–1184, 2008.
 [Romero and Alquezar(2002)] E. Romero and R. Alquezar. Maximizing the margin with feedforward neural networks. In Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN’02 (Cat. No.02CH37290), volume 1, pages 743–748, 2002.
 [Rosenblatt(1958)] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, pages 65–386, 1958.
 [ShalevShwartz and BenDavid(2014)] Shai ShalevShwartz and Shai BenDavid. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
 [Shamir(2016)] Ohad Shamir. Distributionspecific hardness of learning neural networks. CoRR, abs/1609.01037, 2016.
 [Sokolic et al.(2016)] Jure Sokolic, Raja Giryes, Guillermo Sapiro, and Miguel R. D. Rodrigues. Margin preservation of deep neural networks. CoRR, abs/1605.08254, 2016.
 [Sokolic et al.(2017)] Jure Sokolic, Raja Giryes, Guillermo Sapiro, and Miguel R. D. Rodrigues. Robust large margin deep neural networks. IEEE Transactions on Signal Processing, 65, pages 4265–4280, 2017.
 [Song et al.(2017)] Le Song, Santosh Vempala, John Wilmes, and Bo Xie. On the complexity of learning neural networks. CoRR, abs/1707.04615, 2017.
 [Soudry and Carmon(2016)] Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. 2016.
 [Sun et al.(2015)] Shizhao Sun, Wei Chen, Liwei Wang, and TieYan Liu. Large margin deep neural networks: Theory and algorithms. CoRR, abs/1506.05232, 2015.
 [Telgarsky(2016)] Matus Telgarsky. Representation Benefits of Deep Feedforward Networks. In JMLR, 49, pages 1 – 23, 2016.
 [Wilamowski et al.(2003)] Bogdan Wilamowski, David Hunter, and Aleksander Malinowski. Solving parityn problems with feedforward neural networks. In IJCNN, pages 2546 – 2551, 08 2003.
 [Arslanov et al.(2002)] M Z. Arslanov, D U. Ashigaliev, and Esraa Ismail. Nbit parity ordered neural networks. Neurocomputing, 48:1053–1056, 10 2002.
 [Zaheer et al.(2017)] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan Salakhutdinov, and Alexander Smola. Deep sets, 2017.
 [Zou et al.(2018)] Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes overparameterized deep relu networks. CoRR, abs/1811.08888, 2018.
Appendix A The Modified Perceptron
Proof of Theorem 2.
Denote by the optimal separating hyperplane with . It satisfies for all . By the definition,
and
By CauchySchwarz inequality, . So the number of updates is bounded by
At time the margin of any that does not require an update is at least
The right hand side is monotonically decreasing function of so by plugging in the maximal number of updates we see that the minimal margin of the output is at least
∎