On Symmetry and Initialization for Neural Networks

07/01/2019 ∙ by Ido Nachum, et al. ∙ Technion 0

This work provides an additional step in the theoretical understanding of neural networks. We consider neural networks with one hidden layer and show that when learning symmetric functions, one can choose initial conditions so that standard SGD training efficiently produces generalization guarantees. We empirically verify this and show that this does not hold when the initial conditions are chosen at random. The proof of convergence investigates the interaction between the two layers of the network. Our results highlight the importance of using symmetry in the design of neural networks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Building a theory that can help to understand neural networks and guide their construction is one of the current challenges of machine learning. Here we wish to shed some light on the role symmetry plays in the construction of neural networks. It is well-known that symmetry can be used to enhance the performance of neural networks. For example, convolutional neural networks (CNNs) (see

[Lecun et al.(1998)]

) use the translational symmetry of images to classify images better than fully connected neural networks. Our focus is on the role of symmetry in the initialization stage. We show that symmetry-based initialization can be the difference between failure and success.

On a high-level, the study of neural networks can be partitioned to three different aspects.

Expressiveness:

Given an architecture, what are the functions it can approximate well?

Training:

Given a network with a “proper” architecture, can the network fit the training data and in a reasonable time?

Generalization:

Given that the training seemed successful, will the true error be small as well?

We study these aspects for the first “non trivial” case of neural networks, networks with one hidden layer. We are mostly interested in the initialization phase. If we take a network with the appropriate architecture, we can always initialize it to the desired function. A standard method (that induces a non trivial learning problem) is using random weights to initialize the network. A different reasonable choice is to require the initialization to be useful for an entire class of functions. We follow the latter option.

Our focus is on the role of symmetry. We consider the following class of symmetric functions

where and . The functions in this class are invariant under arbitrary permutations of the input’s coordinates. The parity function and the majority function are well-known examples of symmetric functions.

Expressiveness for this class was explored by [Minsky and Papert(1988)]

. They showed that the parity function cannot be represented using a network with limited “connectivity”. Contrastingly, if we use a fully connected network with one hidden layer and a common activation function (like

, , or ) only neurons are needed. We provide such explicit representations for all functions in ; see Lemmas 1 and 2.

We also provide useful information on both the training phase and generalization capabilities of the neural network. We show that, with proper initialization, the training process (using standard SGD) efficiently converges to zero empirical error, and that consequently the network has small true error as well.

Theorem 1.

There exists a constant so that the following holds. There exists a network with one hidden layer, neurons with or activations, and an initialization such that for all distributions over and all functions with sample size , after performing SGD updates with a fixed step size it holds that

where and is the network after training over .

The number of parameters in the network described in Theorem 1 is . So in general one could expect overfitting when the sample size is as small as . Nevertheless, the theorem provides generalization guarantees, even for such a small sample size.

The initialization phase plays an important role in proving Theorem 1. To emphasize this, we report an empirical phenomenon (this is “folklore”). We show that a network cannot learn parity from a random initialization (see Section 4.3). On one hand, if the network size is big, we can bring the empirical error to zero (as suggested in [Soudry and Carmon(2016)]), but the true error is close to . On the other hand, if its size is too small, the network is not even able to achieve small empirical error (see Figure 5). We observe a similar phenomenon also for a random symmetric function. An open question remains: why is it true that a sample of size polynomial in does not suffice to learn parity (with random initialization)?

A similar phenomenon was theoretically explained by [Shamir(2016)] and [Song et al.(2017)]. The parity function belongs to the class of all parities

where is the standard inner product. This class is efficiently PAC-learnable with samples using Gaussian elimination. A continuous version of was studied by [Shamir(2016)] and [Song et al.(2017)]. To study the training phase, they used a generalized notion of statistical queries (SQ); see [Kearns(1998)]. In this framework, they show that most functions in the class

cannot be efficiently learned (roughly stated, learning the class requires an exponential amount of resources). This framework, however, does not seem to capture actual training of neural networks using SGD. For example, it is not clear if one SGD update corresponds to a single query in this model. In addition, typically one receives a dataset and performs the training by going over it many times, whereas the query model estimates the gradient using a fresh batch of samples in each iteration. The query model also assumes the noise to be adversarial, an assumption that does not necessarily hold in reality. Finally, the SQ-based lower bound holds for every initialization (in particular, for the initialization we use here), so it does not capture the efficient training process Theorem 

1 describes.

Theorem 1 shows, however, that with symmetry-based initialization, parity can be efficiently learned. So, in a nutshell, parity can not be learned as part of , but it can be learned as part of . One could wonder why the hardness proof for cannot be applied for as both classes consist of many input sensitive functions. The answer lies in the fact that has a far bigger statistical dimension than (all functions in are orthogonal to each other, unlike ).

The proof of the theorem utilizes the different behavior of the two layers in the network. SGD is performed using a step size that is polynomially small in . The analysis shows that in a polynomial number of steps that is independent of the choice of the following two properties hold: (i) the output neuron reaches a “good” state and (ii) the hidden layer does not change in a “meaningful” way. These two properties hold when is small enough. In Section 4.2, we experiment with large values of . We see that, although the training error is zero, the true error becomes large.

Here is a high level description of the proof. The neurons in the hidden layer define an “embedding” of the inputs space into (a.k.a. the feature map). This embedding changes in time according to the training examples and process. The proof shows that if at any point in time this embedding has good enough margin, then training with standard SGD quickly converges. This is explained in more detail in Section 3. It remains an interesting open problem to understand this phenomenon in greater generality, using a cleaner and more abstract language.

1.1. Background

To better understand the context of our research, we survey previous related works.

The expressiveness and limitations of neural networks were studied in several works such as [Rahimi and Recht(2008), Telgarsky(2016), Eldan and Shamir(2016)] and [Arora et al.(2016)]. Constructions of small networks for the parity function appeared in several previous works, such as [Wilamowski et al.(2003)], [Arslanov et al.(2016)], [Arslanov et al.(2002)] and [Masato Iyoda et al.(2003)]. Constant depth circuits for the parity function were also studied in the context of computational complexity theory, see for example [Furst et al.(1981)], [Ajtai(1983)] and [Håstad(1987)].

The training phase of neural networks was also studied in many works. Here we list several works that seem most related to ours. [Daniely(2017)] analyzed SGD for general neural network architecture and showed that the training error can be nullified, e.g., for the class of bounded degree polynomials (see also [Andoni et al.(2014)]). [Jacot et al.(2018)] studied neural tangent kernels (NTK), an infinite width analogue of neural networks. [Du et al.(2018)] showed that randomly initialized shallow networks nullify the training error, as long as the number of samples is smaller than the number of neurons in the hidden layer. Their analysis only deals with optimization over the first layer (so that the weights of the output neuron are fixed). [Chizat and Bach(2018)] provided another analysis of the latter two works. [Allen-Zhu et al.(2018b)] showed that over-parametrized neural networks can achieve zero training error, as as long as the data points are not too close to one another and the weights of the output neuron are fixed. [Zou et al.(2018)] provided guarantees for zero training error, assuming the two classes are separated by a positive margin.

Convergence and generalization guarantees for neural networks were studied in the following works. [Brutzkus et al.(2017)] studied linearly separable data. [Li and Liang(2018)] studied well separated distributions. [Allen-Zhu et al.(2018a)] gave generalization guarantees in expectation for SGD. [Arora et al.(2019)] gave data-dependent generalization bounds for GD. All these works optimized only over the hidden layer (the output layer is fixed after initialization).

Margins play an important role in learning, and we also use it in our proof. [Sokolic et al.(2016)], [Sokolic et al.(2017)], [Bartlett et al.(2017)] and [Sun et al.(2015)] gave generalization bounds for neural networks that are based on their margin when the training ends. From a practical perspective, [Elsayed et al.(2018)], [Romero and Alquezar(2002)] and [Liu et al.(2016)] suggested different training algorithms that optimize the margin.

As discussed above, it seems difficult for neural networks to learn parities. [Song et al.(2017)] and [Shamir(2016)] demonstrated this using the language statistical queries (SQ). This is a valuable language, but it misses some central aspects of training neural networks. SQ seems to be closely related to GD, but does not seem to capture SGD. SQ also shows that many of the parities functions are difficult to learn, but it does not imply that the parity function is difficult to learn. [Abbe and Sandon(2018)] demonstrated a similar phenomenon in a setting that is closer to the “real life” mechanics of neural networks.

We suggest that taking the symmetries of the learning problem into account can make the difference between failure and success. Several works suggested different neural architectures that take symmetries into account; see [Zaheer et al.(2017)], [Gens and Domingos(2014)], and [Cohen and Welling(2016)].

2. Representations

Here we describe efficient representations for symmetric functions by network with one hidden layer. These representations are also useful later on, when we study the training process. We study two different activation functions, and (similar statement can be proved for other activations, like ). Each activation function requires its own representation, as in the two lemmas below.

Figure 1. Approximations of the symmetric function by and activations for .

2.1. Sigmoid

We start with the activation , since it helps to understand the construction for the activation. The building blocks of the symmetric functions are indicators of for . An indicator function is essentially a sum of two functions:

where .

Lemma 1.

The symmetric function satisfies .

A network with one hidden layer of neurons with activations is sufficient to represent any symmetric function.

Proof.

For all and of weight ,

the first inequality holds since for all and . For all and of weight ,

the first equality follows from the definition, the first inequality neglects the negative sums, and the second inequality follows because for all .

2.2. ReLU

An indicator function can be represented using as , where

A natural idea is to take a linear combination (similarly to the ) to get general functions in . However, this fails because the function is unbounded. The following lemma states the needed correction.

Lemma 2.

Let for . Define . The symmetric function

can be represented as

The lemma shows that a network with one hidden layer of neurons is sufficient to represent any function in . The coefficient of the gates are in this representation.

Proof.

The proof proceeds in two parts. The first part shows the function is constant for all so that . The second part shows that this function equals for all so that and that it is negative for all that satisfy .

For the first part, denote by the value of the symmetric function on inputs of weight . By induction, assume that for some . Think of as a univariate function of the real variable . This function is differentiable for all :

the first equality follows from the definition of , the second equality follows from the definition of the function, and the last equality holds since the first and third sum cancel each other and the second and fourth sum as well. In a similar manner, for all , we have . So, integrating over concludes the induction .

For the second part, we start by proving that . Let . By definition, . For , we have

Induction on can be used to prove that . Now, by the derivatives calculated in the first part, for it holds that .

3. Training and Generalization

The goal of this section is to describe a small network with one hidden layer that (when initialized property) efficiently learns symmetric functions using a small number of examples (the training is done via SGD).

3.1. Specifications

Here we specify the architecture, initialization and loss function that is implicit in our main result (Theorem 

1).

To guarantee convergence of SGD, we need to start with “good” initial conditions. The initialization we pick depends on the activation function it uses, and is chosen with resemblance to Lemma 2 for . On a high level, this indicates that understanding the class of functions we wish to study in term of “representation” can be helpful when choosing the architecture of a neural network in a learning context.

The network we consider has one hidden layer. We denote by the weight between coordinate of the input and neuron in the hidden layer. We denote this matrix of weights. We denote by the bias of neuron of the hidden layer. We denote

this vector of weights. We denote by

is the weight from neuron in the hidden layer to the output neuron. We denote this vector of weights. We denote by the bias of the output neuron.

Initialize the network as follows: The dimensions of are . For all and , we set

We set and .

To run SGD, we need to choose a loss function. We use the hinge loss,

where is the output of the hidden layer on input and is a parameter of confidence.

3.2. Margins

A key property in the analysis is the ‘margin’ of the hidden layer with respect to the function being learned.

A map over a finite set is linearly111A standard “lifting” that adds a coordinate with to every vector allows to translate the affine case to the linear case. separable if there exists such that for all . When the Euclidean norm of is , the number is the margin of with respect to . The number is the margin of .

We are interested in the following set in . Recall that is the weight matrix between the input layer and the hidden layer, and that

is the relevant bias vector. Given

, we are interested in the set , where . In words, we think of the neurons in the hidden layer as defining an “embeding” of in Euclidean space. A similar construction works for other activation functions. We say that agrees with if for all it holds that .

The following lemma bounds from below the margin of the initial .

Lemma 3.

If is a partition that agrees with some function in for the initialization described above then .

Proof.

By Lemmas 1 and 2, we see that any function in can be represented with a vector of weights of the output neuron together with a bias . These induce a partition of . Namely, for all . Since we have our desired result. ∎

3.3. Freezing the Hidden Layer

Before analyzing the full behavior of SGD, we make an observation: if the weights of the hidden layer are fixed with the initialization described above, then Theorem 1 holds for SGD with batch size . This observation, unfortunately, does not suffice to prove Theorem 1. In the setting we consider, the training of the neural network uses SGD without fixing any weights. This more general case is handled in the next section. The rest of this subsection is devoted for explaining this observation.

[Novikoff(1962)]

showed that that the perceptron algorithm

[Rosenblatt(1958)] makes a small number of mistakes for linearly separable data with large margin. For a comprehensive survey of the perceptron algorithm and its variants, see [Moran et al.(2018)].

Running SGD with the hinge loss induces the same update rule as in a modified perceptron algorithm, Algorithm 1.

  Initialize: , , and
  while  with  do
     
     
  end whilereturn
Algorithm 1 The modified perceptron algorithm

Novikoff’s proof can be generalized to any and batches of any size to yield the following theorem; see [Collobert and Bengio(2004), Krauth and Mezard(1987)] and appendix A.

Theorem 2.

For with margin and step size , the modified perceptron algorithm performs at most updates and achieves a margin of at least , where .

So, when the weights of the hidden layer are fixed, Lemma 3 implies that the number of SGD steps is at most polynomial in .

3.4. Stability

When we run SGD on the entire network, the layers interact. For a network at time , the update rule for is as follows. If the network classifies the input correctly with confidence more than , no change is made. Otherwise, we change the weights in by , where is the true label and is the step size. If also neuron of the hidden fired on , we update its incoming weights by . These update rules define the following dynamical system:

(1)
(2)
(3)
(4)

where is the Heaviside step function and is the Hadamard pointwise product.

A key observation in the proof is that the weights of the last layer ((3) and (4)) are updated exactly as the modified perceptron algorithm. Another key statement in the proof is that if the network has reached a good representation of the input (i.e., the hidden layer has a large margin), then the interaction between the layers during the continued training does not impair this representation. This is summarized in the following lemma (we are not aware of a similar statement in the literature).

Lemma 4.

Let , , and be a linearly separable embedding of and with margin by the hidden layer of a neural network of depth two with activation and weights given by . Let , let , and be the integration step. Assuming and , and using in the loss function, after SGD iterations the following hold:

  • Each moves a distance of at most .

  • The norm is at most .

  • The training ends in at most SGD updates.

Intuitively, this type of lemma can be useful in many other contexts. The high level idea is to identify a “good geometric structure” that the network reaches and enables efficient learning.

Proof.

We are interested in the maximal distance the embedding of an element has moved from its initial embedding:

(5)
(6)
(7)

To simplify equations (1)-(4) discussed above, we assume that during the optimization process the norm of the weights and grow at a maximal rate:

(8)
(9)

here the norm of a matrix is the -norm.

To bound these quantities, we follow the modified perceptron proof and add another quantity to bound. That is, the maximal norm of the embedded space at time satisfies (by assumption )

we used that the spectral norm of a matrix is at most its -norm.

We assume a worst-case where grows monotonically at a maximal rate. By the modified perceptron algorithm and choice ,

By choice of and assuming ,

Solving the above recursive equation, it holds for all ,

Now, summing equation 7, we have

since .

So in updates, the elements embedded by the network travelled at most . Hence, the samples the network received kept a margin of during training (by the assumption ). By choice of the loss function, SGD changes the output neuron as in the modified perceptron algorithm. By Theorem 2, the number of updates is at most . So, the assumption on we made during the proof holds.

3.5. Main Result

Proof of Theorem 1.

There is an unknown distribution over the space . We pick i.i.d. examples where according to , where for some . Run SGD for steps, where the step size is and the parameter of the loss function is with .

We claim that it suffices to show that at the end of the training (i) the network correctly classifies all the sample points , and (ii) for every such that there exists with , the network outputs on as well. Here is why. The initialization of the network embeds the space into dimensional space (including the bias neuron of the hidden layer). Let be the initial embedding . Although , the size of is . The VC dimension of all the boolean functions over is . Now, samples suffice to yield true error for an ERM when the VC dimension is ; see e.g. Theorem 6.7 in [Shalev-Shwartz and Ben-David(2014)]. It remains to prove (i) and (ii) above.

By Lemma 3, at the beginning of the training, the partition of defined by the target has a margin of . We are interested in the eventual embedding of as well. The modified perceptron algorithm guarantees that after updates, () separates the embedded sample with a margin of at least . This happens as long as the updates we perform come from a set with maximal norm and with margin at least . This is guaranteed by Lemma 4 and concludes the proof of (i).

It remains to prove (ii). Lemma 4 states that as long as less than updates were made, the elements in moved at most . At the end of the training, the embedded sample is separated with a margin of at least

with respect to the hyperplane defined by

and . Each for moved at most . This means that if then the network has the same output on and . Since the network has zero empirical error, the output on this is as well.

A similar proof is available with activation (with better convergence rate and larger allowed step size).

Remark.

The generalization part of the above proof can be viewed as a consequence of sample compression ([Littlestone and Warmuth(1986)]). Although the eventual network depends on all examples, the proof shows that its functionality depends on at most examples. Indeed, after the training, all examples with equal hamming weight have the same label.

Remark.

The parameter

we chose in the proof may seem odd and negligible. It is a construct in the proof that allows us to bound efficiently the distance that the elements in

have moved during the training. For all practical purposes works as well (see Figure 4).

4. Experiments

We accompany the theoretical results with some experiments. We used a network with one hidden layer of neurons, activation, and the hinge loss with

. In all the experiments, we used SGD with mini-batch of size one and before each epoch we randomized the sample. The graphs present the training error and the true error

222We deal with high dimensional spaces, so the true error was not calculated exactly but approximated on an independent batch of samples of size . versus the epoch of the training process. In all the comparisons below, we chose a random symmetric function and a random sample from .

(a) input dimension
(b) input dimension
Figure 2. Error during training for a sample of size .
(a) step size
(b) step size
(c) step size
Figure 3. Error during training for an input dimension and a sample of size .

Figure 4. : error during training for an input dimension and a sample of size .
(a) sample of size
(b) sample of size
(c) sample of size
(d) sample of size
Figure 5. Parity: error during training for input dimension .
(a) sample of size
(b) sample of size
(c) sample of size
(d) sample of size
Figure 6. Random symmetric function: error during training for input dimension .

4.1. The Theory in Practice

Figure 2 demonstrates our theoretical results and also validates the performance of our initialization. In one setting, we trained only the second layer (freezed the weights of the hidden layer) which essentially corresponds to the perceptron algorithm. In the second setting, we trained both layers with a step size (as the theory suggests). As expected, performance in both cases is similar. We remark that SGD continues to run even after minimizing the empirical error. This happens because of the parameter .

4.2. Overstepping the Theory

Here we experiment with two parameters in the proof, the step size and the confidence parameter .

In Figure 3, we used three different step sizes, two of which much larger than the theory suggests. We see that the training error converges much faster to zero, when the step size is larger. This fast convergence comes at the expense of the true error. For a large step size, generalization cease to hold.

Setting is a construct in the proof. Figure 4 shows that setting does not impair the performance. The difference between theory (requires ) and practice (allows ) can be explained as follows. The proof bounds the worst-case movement of the hidden layer, whereas in practice an average-case argument suffices.

4.3. Hard to Learn Parity

Figure 5 shows that even for , learning parity is hard from a random initialization. When the sample size is small the training error can be nullified but the true error is large. As the sample grows, it becomes much harder for the network to nullify even the training error. With our initialization, both the training error and true error are minimized quickly. Figure 6 demonstrates the same phenomenon for a random symmetric function.

4.4. Corruption of Data

Our initialization also delivers satisfying results when the input data it corrupted. In figure 7

, we randomly perturb (with probability

) the labels and use the same SGD to train the model. In figure 8, we randomly shift every entry of the vectors in the space by

that is uniformly distributed in

.

Figure 7. Label error resistance. Labels of the sample were flipped with probability . Sample of size whose input dimension is .

Figure 8. Input Error resistance. All the entries of the vectors in the space were randomly shifted. Sample of size whose input dimension is .

5. Conclusion

This work demonstrates that symmetries can play a critical role when designing a neural network. We proved that any symmetric function can be learned by a shallow neural network, with proper initialization. We demonstrated by simulations that this neural network is stable under corruption of data, and that the small step size is the proof is necessary.

We also demonstrated that the parity function or a random symmetric function cannot be learned with random initialization. How to explain this empirical phenomenon is still an open question. The works [Shamir(2016)] and [Song et al.(2017)] treated parities using the language of SQ. This language obscures the inner mechanism of the network training, so a more concrete explanation is currently missing.

We proved in a special case that the standard SGD training of a network efficiently produces low true error. The general problem that remains is proving similar results for general neural networks. A suggestion for future works is to try to identify favorable geometric states of the network that guarantee fast convergence and generalization.

Acknowledgements

We wish to thank Adam Klivans for helpful comments.

References

Appendix A The Modified Perceptron

Proof of Theorem 2.

Denote by the optimal separating hyperplane with . It satisfies for all . By the definition,

and

By Cauchy-Schwarz inequality, . So the number of updates is bounded by

At time the margin of any that does not require an update is at least

The right hand side is monotonically decreasing function of so by plugging in the maximal number of updates we see that the minimal margin of the output is at least