On the distance between two neural networks and the stability of learning

02/09/2020 ∙ by Jeremy Bernstein, et al. ∙ 79

How far apart are two neural networks? This is a foundational question in their theory. We derive a simple and tractable bound that relates distance in function space to distance in parameter space for a broad class of nonlinear compositional functions. The bound distills a clear dependence on depth of the composition. The theory is of practical relevance since it establishes a trust region for first-order optimisation. In turn, this suggests an optimiser that we call Frobenius matched gradient descent—or Fromage. Fromage involves a principled form of gradient rescaling and enjoys guarantees on stability of both the spectra and Frobenius norms of the weights. We find that the new algorithm increases the depth at which a multilayer perceptron may be trained as compared to Adam and SGD and is competitive with Adam for training generative adversarial networks. We further verify that Fromage scales up to a language transformer with over 10^8 parameters. Please find code reproducibility instructions at: https://github.com/jxbz/fromage.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

fromage

🧀 Pytorch code for the Fromage optimiser.


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Suppose that a teacher wishes to assess a student’s learning. Traditionally, they will assign the student homework and track their progress. What if, instead, they could peer inside the student’s head and observe change directly in the synapses—would that not be better for everyone?

Figure 1: How far can we trust local knowledge about a deep compositional function such as a deep neural network? Given the value and gradient at the point marked , the shaded regions represent the set of consistent functions under two different models of trust. The dashed lines represent some example functions. Our model (right) is well-suited for deep neural networks and predicts a catastrophic loss of trust—beyond a threshold, all bets are off.

Neural networks are usually trained by (stochastic) gradient descent. The basic premise is that gradient descent solves:

That is, gradient descent chooses the parameter perturbation to minimise a local linear approximation to the objective function , where we add the penalty to prevent from straying beyond the region where the gradient is trusted (Nocedal and Wright, 2006). For gradient descent, the penalty takes the form:

Definition 1 (Euclidean trust).

We refer to this model as Euclidean trust since a quadratic penalty is akin to assuming a Euclidean structure on the parameter space. We perform a theoretical analysis and experimental study to test this model and find evidence that for multilayer perceptrons, trust is lost not quadratically but rather quasi-exponentially in the perturbation size. Figure 1 illustrates the difference.

Our analysis exposes the following mathematical structure for the trust region of a broad family of deep neural networks with layers indexed :

Definition 2 (Deep relative trust).

Deep relative trust has two essential features: the first is a dependence on the relative magnitude of perturbations; the second is a product over the network’s layers, reflecting the product structure of the network itself. These features are both absent from Euclidean trust. In our model, relative perturbations across layers compound.

The main contributions of this paper are:

  1. proposing that deep relative trust is an appropriate notion of distance between neural networks based on both theoretical analysis and experimental evidence.

  2. developing an optimisation theory based on deep relative trust, and using the tools of matrix perturbation theory to study the stability of learning.

  3. deriving a neural network optimiser called Fromage (Algorithm 1) that exploits the new theory. The algorithm has

    one hyperparameter

    with a clear meaning.

  4. benchmarking Fromage on popular machine learning problems such as image classification, generative adversarial networks and natural language transformers, revealing often favourable performance compared to standard optimisers such as Adam and SGD.

2 Entendámonos…

…so we understand each other.

The goal of this section is to review a few basics of deep learning, including heuristics commonly used in algorithm design and areas where current optimisation theory falls short. We shall also review generative adversarial learning.

We shall see that, whilst it is central to both optimisation and generative adversarial learning, finding an appropriate notion of functional distance for deep networks is not a solved problem.

Deep learning basics

Deep learning seeks to fit a neural network function with parameters to a dataset of input-output pairs . If we let measure the discrepancy between prediction and target , then learning proceeds by gradient descent on the loss: .

Though various neural network architectures exist, we shall focus our theoretical effort on the multilayer perceptron, which already contains the most striking features of general neural networks: matrices, nonlinearities, and layers.

Definition 3 (Multilayer perceptron).

A multilayer perceptron is a function composed of layers.

The th layer is a linear map followed by a nonlinearity that is applied elementwise.

The multilayer perceptron may be described recursively in terms of the th hidden layer as:

Since we wish to fit the network via gradient descent, we shall be interested in the gradient of the loss with respect to the

th parameter matrix. Schematically, via the chain rule:

(1)

Let us zoom in on the second term on the righthand side, following the treatment of Pennington et al. (2017).

  Input: learning rate and weight matrices
  repeat
     collect gradients

via backpropagation

     for layer to  do
        
     end for
  until converged
Algorithm 1 Fromage (a good default for is ).
Proposition 1 (Jacobian of the multilayer perceptron).

Consider a multilayer perceptron with layers. For , the layer--to-output Jacobian is given by:

where .

A key observation is that the network function and Jacobian share a common mathematical structure—a deep, layered composition. We shall exploit this in our theory.

Empirical deep learning

For the th layer, gradient descent prescribes the update:

(2)

where is a small perturbation parameter or learning rate chosen independent of layer.

Practitioners quickly run into a problem with this formulation known as the

vanishing and exploding gradient problem

, where the scale of updates becomes miscalibrated with the scale of parameters in different layers of the network. Common tricks to ameliorate the problem include careful choice of weight initialisation (Glorot and Bengio, 2010), dividing out the gradient scale (Kingma and Ba, 2015)

and gradient clipping

(Pascanu et al., 2013). Each of the techniques has been adopted in numerous deep learning applications.

Still, there is a cost to using heuristic techniques. For instance, techniques that rely on careful initialisation may break down by the end of training, leading to instabilities that are difficult to trace. Gradient clipping involves introducing and tuning a new parameter: the clipping threshold.

Related work in deep learning optimisation theory

Euclidean trust, as set up in the introduction, is commonly justified by assuming that the loss function has Lipschitz continuous gradients, meaning that:

By a standard argument (Bottou et al., 2016), this implies a quadratic or Euclidean upper bound on the loss function:

Gradient descent as in iteratively minimises this bound.

The gradient-Lipschitz assumption is ubiquitous to the point that it is often just referred to as smoothness (Hardt et al., 2016). The assumption is a natural starting point for theory and it is used by: Hardt et al. (2016), Lee et al. (2016), Du et al. (2017) and Allen-Zhu (2018) in the context of deep learning optimisation; Bernstein et al. (2018) in the context of distributed training; Schaefer and Anandkumar (2019) in the context of generative adversarial networks.

The Lipschitz assumption played a central role in classical optimisation (Nesterov, 2014, Chapter 1). However, it is unclear the how applicable the assumption is to deep learning—in a comprehensive review on deep learning optimization, Sun (2019) writes that “neural network optimization problems do not have a global gradient Lipschitz constant” and that “the lack of global Lipschitz constants is a general challenge for non-linear optimization”.

The surest way to see that neural networks are not gradient-Lipschitz for all practical purposes is to measure the gradient empirically. We do this for a 16 layer multilayer perceptron, and find that the gradient grows roughly exponentially in the size of a perturbation (Figure 2). For more work of that ilk, Benjamin et al. (2019) empirically test the use of Euclidean distance as a proxy for functional distance, and find the relationship non-trivial and difficult to interpret.

Several classical optimisation frameworks study non-Euclidean models of functional distance. For example, mirror descent (Nemirovsky and Yudin, 1983) replaces by a Bregman divergence appropriate to the geometry of the problem. This framework was studied in relation to deep learning (Azizan and Hassibi, 2019; Azizan et al., 2019), but the design of good divergence measures remains an area of active research.

Another classical technique is natural gradient descent (Amari, 2016), which replaces by . The Riemannian metric should capture the geometry of the -dimensional function class. Unfortunately, this technique is computationally heavy since just writing down the metric takes space, and for neural networks . Whilst Martens and Grosse (2015) explore more efficient surrogates, natural gradient descent is fundamentally a quadratic model of trust. Our results suggest that trust is lost far more catastrophically in deep networks (Figure 1).

A final line of related work studies the effect of architectural decisions on signal propagation through the network (Saxe et al., 2014; Pennington et al., 2017; Yang and Schoenholz, 2017; Xiao et al., 2018; Anil et al., 2019), which inspired aspects of our work. Though these works neglect theoretical study of functional distance and curvature of the loss surface, they do carry out direct analyses of the deep neural network structure. Pennington and Bahri (2017)

, on the other hand, do study curvature of the loss surface, though they rely on random matrix models to make progress.

Generative adversarial networks

Neural networks can learn to generate samples from complex distributions. Generative adversarial learning (Goodfellow et al., 2014) trains a discriminator network

to classify data as real or fake, and a generator network

is trained to fool . Competition drives learning in both networks. Letting denote the success rate of the discriminator, the learning process is described as:

Defining the optimal discriminator for a given generator as Then generative adversarial learning reduces to a straightforward minimisation over the parameters of the generator:

In practice this is solved as an inner-loop, outer-loop optimisation procedure where steps of gradient descent are performed on the discriminator, followed by step on the generator. For example, Miyato et al. (2018) take and Brock et al. (2019) take .

For small , this procedure is only well founded if the perturbation to the generator is small so as to induce a small perturbation in the optimal discriminator. In symbols, we hope that

But what does mean? In what sense should it be small? Again, we realise that we are lacking an appropriate notion of functional distance for neural networks.

3 The distance between neural networks

We would like to establish a meaningful notion of functional distance for neural networks. The main pitfall of the Euclidean distance on parameters is that it does not reflect the product structure of the network.

To guide intuition, consider a simple network that multiplies its input by two scalars . That is . Also consider perturbed function where and . By expanding the square and bounding the cross-terms with Young’s inequality, we find that the relative difference obeys:

We flesh out this important derivation in the appendix. The following theorem, also proved in the appendix, generalises this argument to the deep, nonlinear case.

Theorem 1 (Relative functional difference).

Let be a multilayer perceptron with nonlinearity and weight matrices . Likewise consider perturbed network with weight matrices . For convenience, we define perturbation matrices .

Let the dimension of the th hidden layer be , meaning that . We define the maximum width .

Suppose that the following conditions hold:

  1. Fixed point. The nonlinearity satisfies .

  2. Transmission. There exist such that :

  3. Conditioning. Each of the unperturbed weight matrices has condition number bounded by .

For all non-zero inputs we have:

where we have defined .

In words, Theorem 1 says that the change of a multilayer perceptron in function space is controlled by deep relative trust (Definition 2). As deep relative trust goes to zero, the relative change in function space goes to zero too.

Bounding the relative change in function in terms of the relative change in parameters is reminiscent of a concept from numerical analysis known as the relative condition number. The relative condition number of a numerical technique measures the sensitivity of the technique to input perturbations. This suggests that we may think of Theorem 1 as defining the relative condition number of a neural network with respect to parameter perturbations.

We must discuss the plausibility of the assumptions. The first two conditions are on the nonlinearity and are both satisfied by the “leaky relu” function, where for

:

Setting yields the “relu” function, which only satisfies the second condition with for which the bound diverges. We may suspect that for inputs that occur in practice, the second assumption may hold for relu with an . We leave detailed investigation for future work.

As for the third condition, in general may be infinite—rendering the bound vacuous. However, we know by smoothed analysis of the condition number (Sankar et al., 2006; Bürgisser and Cucker, 2010) that

is finite with probability

for an iid Gaussian initialisation, and continues to be so throughout training provided a small amount of iid Gaussian noise is added to the updates.

4 Breakdown of a local linear approximation

In the last section we studied the relative functional difference between two neural networks and found that it depends on deep relative trust. Here we will focus on the relative difference in gradient, so that we may establish a trust region for optimisation. We shall see that the relative functional difference and relative gradient difference are connected.

We are interested in the relative change in the gradient expression (1). Tackling the product of the three terms on the right-hand side directly is challenging, not least because the loss function is unknown and arbitrary. As a result, we will tackle each term individually.

We will argue that both the first term and the third term depend on the output of a hidden layer, and since a hidden layer is itself the last layer of a sub-network, these terms are connected to deep relative trust via Theorem 1.

To realise this argument, observe that the first term depends on the network output . For example, for the squared error loss we have and . Similarly, the third term depends on the output of layer . To see this, note that and therefore schematically we have that .

The final term to tackle is the middle term in  (1): . This is the layer--to-output Jacobian. As detailed in Proposition 1, it is a product of matrices. We proffer the following theorem to bound its relative change:

Figure 2: Using Fromage, we train a 2-layer (left) and 16-layer (right) perceptron to classify the MNIST dataset. With the network frozen at ten different training checkpoints, we first compute the gradient of the th layer using the full data batch. We then record the loss and full batch gradient after perturbing all weight matrices () to for various perturbation strengths . We plot the relative change in gradient of the input layer and also the classification loss along these parameter slices. Note that these plots are on a log scale. We find that the loss and relative change in gradient grow quasi-exponentially when the perceptron is deep, suggesting that Euclidean trust is violated. As such, these results seem more consistent with deep relative trust.
Theorem 2 (Relative Jacobian difference).

Let be a multilayer perceptron with nonlinearity and weight matrices . Likewise consider perturbed network with weight matrices . For convenience, we define perturbation matrices .

Let the dimension of the th hidden layer be , so that . We define the maximum width .

Suppose that the following conditions hold:

  1. Transmission. There exist such that :

  2. Conditioning. Each of the unperturbed weight matrices has condition number bounded by .

Then we have that:

where we have defined constants:

Notice that the assumptions are a subset of those made in Theorem 1. The proof is given in the appendix.

Let us inspect the result itself. Up to the inclusion of the constant , we see that deep relative trust appears on the right-hand side of Theorem 2 just as it did for Theorem 1.

5 Descent under relative trust

Up until this point in the paper, we have introduced the concept of deep relative trust and shown theoretically how it connects to both the relative functional difference and relative gradient difference for a broad class of neural networks. What significance does this have for optimisation?

The most striking prediction of the theory is that for large depth , a neural network diverges quasi-exponentially in the relative size of the parameter perturbation. To see this, we compare deep relative trust to the product form of :

We visualise this prediction in Figure 1. We test it by comparing the loss and gradient along parameter slices for a 2-layer and 16-layer multilayer perceptron. The results are given in Figure 2 and seem to support the idea of a catastrophic breakdown in trust.

The time has come to derive algorithms. We wish to solve:

(3)

Solving (3) exactly is challenging because of the coupling across layers. Whilst one can imagine various approximation schemes such as a mean-field theory in depth, a solution via perturbation series or even a numerical solution, we prefer to keep matters simple in this work.

Figure 3: Comparing deep relative trust (deep pink) and our surrogate (orange). The comparison is made for perturbations of fixed relative size for all layers . The surrogate becomes increasingly accurate for large .

A decoupled surrogate

We introduce a surrogate to deep relative trust to decouple the effect of perturbations across layers for tractability.

Definition 4 (Surrogate to deep relative trust).

To understand the use of this surrogate, observe first that it depends on the relative size of the perturbations, second it is a polynomial of the same order as deep relative trust, and third for large perturbations of constant relative size across layers, the two concepts of trust are the same. To see this, consider perturbations of relative size , meaning that for all layers . Then as :

We compare deep relative trust and its surrogate in Figure 3. The comparison is for a 20 layer network assuming a fixed perturbation size across layers.

Then let us replace (3) by its surrogate. We define and obtain the following optimisation problem:

Notice that the optimisation problem conveniently decouples over layers. For each layer , we have:

For the th layer, it is clear that the minimiser is of the form for some , since the gradient is the only direction in the problem, and would be inappropriate. We substitute in and minimise over to obtain:

Figure 4: Training multilayer perceptrons at depths challenging for existing optimisers. We train multilayer perceptrons of depth

on the MNIST dataset. At each depth, we plot the training accuracy after 100 epochs. For each algorithm, we plot the best performing run over 3 learning rate settings found to be appropriate for that algorithm. We also plot trend lines to help guide the eye.

A natural way to obtain a depth-independent algorithm is to let the depth . We adopt the scaling so that is kept in the limit. We arrive at:

(4)

We see that our theoretical arguments have recovered a special form of “gradient clipping”. You et al. (2017) proposed a similar update rule based on empirical observations. Unfortunately, there is still an issue with this update rule, in that the update tends to increase weight norms. To see this, consider an update that is orthogonal to the matrix . Then, by (4), the norm of the updated weights is given by:

This is just Pythagoras’ theorem, as visualised in the inset figure. We see that the Frobenius norm of the parameters tends to grow by a factor .

This effect can be serious when the model class is invariant to the parameter scale as is the case for common weight normalisation schemes (Ioffe and Szegedy, 2015; Miyato et al., 2018). Under these schemes, the loss function provides no incentive to control the parameter scale and the norm will grow without bound.

We present the appropriately corrected version of the algorithm in Algorithm 1 on page 1. We call it Fromage, short for Frobenius matched gradient descent.

Figure 5: Training a class-conditional generative adversarial network on the CIFAR-10 dataset (Krizhevsky, 2009). Top: we plot the norms across layers during training. Fromage stabilises the norms whereas for Adam they wander—which can be a serious issue (Brock et al., 2019, Figure 27). Bottom: we plot the mean and shade the range of the FID score (Heusel et al., 2017) during training. We attain a state of the art FID score just by switching the optimiser to Fromage.

A guide to choosing hyperparameters

One of the attractive features of Algorithm 1 is that there is only one hyperparameter and its meaning is obvious. Neglecting the second order correction, we have that for every layer , the algorithm’s update satisfies:

(5)

In words: the algorithm induces a relative change of in each layer of the neural network. If we set , then the weight matrices are allowed to change by per iteration. In practice, we find this value to be a good default.

The contrast to SGD and Adam is stark. For these algorithms, the learning rate has little intrinsic meaning, and the effective perturbation strength depends on a complicated interplay between four factors: initial weight scale, weight decay hyperparameter, weight growth during training and the user-prescribed learning rate hyperparameter.

We may say more about Fromage by appealing to Mirsky’s theorem—a basic result in matrix perturbation theory.

Theorem 3 (Mirsky (1960)).

Let and be two matrices in . Let and

respectively denote their ordered singular values. Then we have that

We apply this result to the th network layer. Let denote the singular values of and denote the singular values of . Then dividing Theorem 3 through by the root mean square singular value , we obtain:

where we have substituted in (5). In words: the learning rate controls a relative notion of spectral shift.

Figure 6: Training the resnet50

neural network to classify the Imagenet dataset. Top: we compare Fromage to Adam and SGD

without weight decay. We plot the mean and shade the range over 3 repeats. Fromage attains the best test accuracy. Bottom: this time we tune a weight decay parameter for SGD only. With this extra tuning, SGD recovers the test set accuracy of Fromage. Requiring less hyperparameter tuning is desirable for an algorithm.

Spectral instabilities were found by Brock et al. (2019) in the context of large-scale generative adversarial network training with the Adam algorithm. Fromage’s natural ability to control spectral shift therefore seems desirable.

6 Empirical study

Detailed instructions to reproduce these experiments are here: https://github.com/jxbz/fromage.

Figure 7: We fine-tune a bert-base transformer on the SQuAD1.0 dataset. We tune the learning rate for each algorithm with other hyperparameter settings as default from Wolf et al. (2019). We plot the mean and shade the range over 3 repeats. Fromage marginally outperforms the baselines in terms of F1 score.

To test the main prediction of our theory—that the function and gradient of a deep network break down quasi-exponentially in the size of the perturbation—we directly study the behaviour of a multilayer perceptron trained on the MNIST dataset (Lecun et al., 1998) under parameter perturbations. Perturbing along the gradient direction, we find that the change in gradient and objective function is indeed quasi-exponential for a deep network (see Figure 2).

The theory also predicts that the geometry of trust for a deep network becomes increasingly pathological as the network gets deeper, and Fromage is specifically designed to account for this. In Figure 4, we find that Adam and SGD are unable to train multilayer perceptrons over 25 layers deep whereas Fromage was able to train up to at least depth 50.

To test the predictions about the Frobenius norm stability of Fromage, we train a class-conditional generative adversarial network (Miyato et al., 2018) on the CIFAR-10 dataset (Krizhevsky, 2009). We find (Figure 5) that Fromage almost perfectly stabilises the Frobenius norms, whereas when training with Adam the norms wander significantly.

Next, we benchmark Fromage on three canonical deep learning tasks: generative adversarial image generation, image classification and natural language processing.

We find that Fromage outperforms Adam for training a class-conditional generative adversarial network on the CIFAR-10 dataset. The results are given in Figure 5. Next, when training a resnet50 network to classify the Imagenet dataset (Deng et al., 2009), Fromage outperforms SGD without weight decay and matches SGD with weight decay (Figure 6), meaning that Fromage requires less tuning in this setting. Finally, when fine-tuning a transformer on SQuAD1.0 (Rajpurkar et al., 2016), Fromage marginally outperforms Adam and SGD in evaluation score (Figure 7).

7 Limitations and future work

Figure 8: Using Fromage to train a resnet18 for CIFAR-10 classification. We plot the mean and shade the range over 3 repeats, at various training batch sizes. Top: the training accuracy is roughly insensitive to batch size. Bottom: pathologies exist in the loss at small batch size, as a small subset of batches acquire large loss.

It is common practice in deep learning to randomly subsample data to evaluate the gradient. Our theory is limited in that it neglects this stochasticity entirely. In one of our experiments (Figure 8) we witnessed an instability in Fromage at small batch size. Whilst we found that introducing a form of momentum fixed the problem, future work could investigate the theory of stochastic Fromage more thoroughly.

Our theory is also limited in that it only applies to the multilayer perceptron—the model organism

of deep learning theory. Neural networks found in the wild depart from this basic structure in several key ways. Residual connections

(He et al., 2016) and batch normalisation (Ioffe and Szegedy, 2015) have been found to stabilise deep network training in numerous applications. Using our tools to analyse these techniques could be a fruitful direction in which to head.

8 Conclusion

We have written down a distance on deep neural networks and studied the implications of this distance for optimisation. We are optimistic that deep relative trust may also help in studying convergence and generalisation in deep learning.

Indeed, recent work (Wilson et al., 2017; Azizan et al., 2019) has studied the relationship between the optimisation algorithm and generalisation. Since we found that Fromage tended to generalise well in our experiments, we are curious to see how it fits into this picture.

Acknowledgements

The authors would like to thank Dillon Huff, Jeffrey Pennington and Florian Schaefer for useful conversations. They made heavy use of a codebase built by Jiahui Yu. They are much obliged to Sivakumar Arayandi Thottakara, Jan Kautz, Sabu Nadarajan and Nithya Natesan for infrastructure support. JB is supported by an NVIDIA fellowship.

References

  • Z. Allen-Zhu (2018) Natasha 2: faster non-convex optimization than SGD. In Neural Information Processing Systems, Cited by: §2.
  • S. Amari (2016) Information geometry and its applications. Springer. Cited by: §2.
  • C. Anil, J. Lucas, and R. Grosse (2019) Sorting out Lipschitz function approximation. In International Conference on Machine Learning, Cited by: §2.
  • N. Azizan and B. Hassibi (2019) Stochastic gradient/mirror descent: minimax optimality and implicit regularization. In International Conference on Learning Representations, Cited by: §2.
  • N. Azizan, S. Lale, and B. Hassibi (2019) Stochastic mirror descent on overparameterized nonlinear models: convergence, implicit regularization, and generalization. arXiv:1906.03830. Cited by: §2, §8.
  • A. Benjamin, D. Rolnick, and K. Kording (2019) Measuring and regularizing networks in function space. In International Conference on Learning Representations, Cited by: §2.
  • J. Bernstein, Y. Wang, K. Azizzadenesheli, and A. Anandkumar (2018) SignSGD: Compressed optimisation for non-convex problems. In International Conference on Machine Learning, Cited by: §2.
  • L. Bottou, F. E. Curtis, and J. Nocedal (2016) Optimization methods for large-scale machine learning. SIAM Review. Cited by: §2.
  • A. Brock, J. Donahue, and K. Simonyan (2019) Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, Cited by: §2, Figure 5, §5.
  • P. Bürgisser and F. Cucker (2010) Smoothed analysis of Moore-Penrose inversion. SIAM Journal on Matrix Analysis and Applications. Cited by: §3.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In Computer Vision and Pattern Recognition, Cited by: §6.
  • S. S. Du, C. Jin, J. D. Lee, M. I. Jordan, A. Singh, and B. Poczos (2017) Gradient descent can take exponential time to escape saddle points. In Neural Information Processing Systems, Cited by: §2.
  • X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In

    International Conference on Artificial Intelligence and Statistics

    ,
    Cited by: §2.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Neural Information Processing Systems, Cited by: §2.
  • M. Hardt, B. Recht, and Y. Singer (2016) Train faster, generalize better: stability of stochastic gradient descent. In International Conference on Machine Learning, Cited by: §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Computer Vision and Pattern Recognition, Cited by: §7.
  • M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Neural Information Processing Systems, Cited by: Figure 5.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, Cited by: §5, §7.
  • D. P. Kingma and J. Ba (2015) Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations, Cited by: §2.
  • A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Technical report Cited by: Figure 5, §6.
  • Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE. Cited by: §6.
  • J. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht (2016) Gradient descent only converges to minimizers. In Conference on Learning Theory, Cited by: §2.
  • J. Martens and R. Grosse (2015) Optimizing neural networks with kronecker-factored approximate curvature. In International Conference on Machine Learning, Cited by: §2.
  • L. Mirsky (1960) Symmetric gauge functions and unitarily invariant norms. The Quarterly Journal of Mathematics. Cited by: Theorem 3.
  • T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018) Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, Cited by: §2, §5, §6.
  • A. Semenovich. Nemirovsky and D. Borisovich. Yudin (1983) Problem complexity and method efficiency in optimization. Wiley. Cited by: §2.
  • Y. Nesterov (2014) Introductory lectures on convex optimization: a basic course. Springer. Cited by: §2.
  • J. Nocedal and S. Wright (2006) Numerical optimization. Springer. Cited by: §1.
  • R. Pascanu, T. Mikolov, and Y. Bengio (2013)

    On the difficulty of training recurrent neural networks

    .
    In International Conference on Machine Learning, Cited by: §2.
  • J. Pennington and Y. Bahri (2017) Geometry of neural network loss surfaces via random matrix theory. In International Conference on Machine Learning, Cited by: §2.
  • J. Pennington, S. Schoenholz, and S. Ganguli (2017) Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. In Neural Information Processing Systems, Cited by: §2, §2.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100, 000+ questions for machine comprehension of text. In Empirical Methods in Natural Language Processing, Cited by: §6.
  • A. Sankar, D. A. Spielman, and S. Teng (2006) Smoothed analysis of the condition numbers and growth factors of matrices. SIAM Journal on Matrix Analysis and Applications. Cited by: §3.
  • A. M. Saxe, J. L. McClelland, and S. Ganguli (2014) Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In International Conference on Learning Representations, Cited by: §2.
  • F. Schaefer and A. Anandkumar (2019) Competitive gradient descent. In Neural Information Processing Systems, Cited by: §2.
  • R. Sun (2019) Optimization for deep learning: theory and algorithms. arXiv:1912.08957. Cited by: §2.
  • A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht (2017) The marginal value of adaptive gradient methods in machine learning. In Neural Information Processing Systems, Cited by: §8.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew (2019) HuggingFace’s transformers: state-of-the-art natural language processing. arXiv:1910.03771. Cited by: Figure 7.
  • L. Xiao, Y. Bahri, J. Sohl-Dickstein, S. Schoenholz, and J. Pennington (2018)

    Dynamical isometry and a mean field theory of CNNs: how to train 10,000-layer vanilla convolutional neural networks

    .
    In International Conference on Machine Learning, Cited by: §2.
  • G. Yang and S. Schoenholz (2017) Mean field residual networks: on the edge of chaos. In Neural Information Processing Systems, Cited by: §2.
  • Y. You, I. Gitman, and B. Ginsburg (2017) Scaling SGD batch size to 32K for Imagenet training. Technical report Technical Report UCB/EECS-2017-156. Cited by: §5.

Appendix

We begin by fleshing out the analysis of the two-layer scalar network, since this example already goes a long way to exposing the relevant mathematical structure.

Consider defined by for . Also consider perturbed function where and . The relative difference obeys:

We already see the presence of strong interactions between the two layers. But let us simplify the expression by using Young’s inequality on the cross-terms. We obtain:

Our two main theorems generalise this argument to far more involved cases. See 1

To aid in the proof of this result, we shall first state and prove two useful lemmas.

Lemma 1 (Matrix-vector conditioning).

Let be a matrix in with singular values . Assume that has bounded condition number . Then for all ,

Proof.

Observe that

Since , we have that and , from which the result follows. ∎

Lemma 2 (Relative magnitude).

Under the same conditions as Theorem 1, we have that for the th hidden layer :

Proof.

First observe that a trivial consequence of the first two assumptions is that for any . Now recall that we have defined the maximum width of the network as . Then we may relax Lemma 1 to:

This fact will prove its worth in the following argument:

(assumption on )
(Lemma 1)

The lemma follows from an obvious induction on depth. ∎

With these tools in hand, let us proceed to Theorem 1.

Proof of Theorem 1.

Again observe that a trivial consequence of the first two assumptions is that for any .

To make an inductive argument, we shall assume that the result holds for a network with layers. Extending to depth , we have:

We may bound the first term by Lemma 2, and the second term by the inductive hypothesis. Then we obtain: