
Fast Convergence of Natural Gradient Descent for Overparameterized Neural Networks
Natural gradient descent has proven effective at mitigating the effects of pathological curvature in neural network optimization, but little is known theoretically about its convergence properties, especially for nonlinear networks. In this work, we analyze for the first time the speed of convergence for natural gradient descent on nonlinear neural networks with the squarederror loss. We identify two conditions which guarantee the efficient convergence from random initializations: (1) the Jacobian matrix (of network's output for all training cases with respect to the parameters) is full row rank, and (2) the Jacobian matrix is stable for small perturbations around the initialization. For twolayer ReLU neural networks (i.e., with one hidden layer), we prove that these two conditions do in fact hold throughout the training, under the assumptions of nondegenerate inputs and overparameterization. We further extend our analysis to more general loss functions. Lastly, we show that KFAC, an approximate natural gradient descent method, also converges to global minima under the same assumptions.
05/27/2019 ∙ by Guodong Zhang, et al. ∙ 35 ∙ shareread it

Adversarial Robustness through Local Linearization
Adversarial training is an effective methodology for training deep neural networks that are robust against adversarial, normbounded perturbations. However, the computational cost of adversarial training grows prohibitively as the size of the model and number of input dimensions increase. Further, training against less expensive and therefore weaker adversaries produces models that are robust against weak attacks but break down under attacks that are stronger. This is often attributed to the phenomenon of gradient obfuscation; such models have a highly nonlinear loss surface in the vicinity of training examples, making it hard for gradientbased attacks to succeed even though adversarial examples still exist. In this work, we introduce a novel regularizer that encourages the loss to behave linearly in the vicinity of the training data, thereby penalizing gradient obfuscation while encouraging robustness. We show via extensive experiments on CIFAR10 and ImageNet, that models trained with our regularizer avoid gradient obfuscation and can be trained significantly faster than adversarial training. Using this regularizer, we exceed current state of the art and achieve 47 ImageNet with linfinity adversarial perturbations of radius 4/255 under an untargeted, strong, whitebox attack. Additionally, we match state of the art results for CIFAR10 at 8/255.
07/04/2019 ∙ by Chongli Qin, et al. ∙ 3 ∙ shareread it

Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model
Increasing the batch size is a popular way to speed up neural network training, but beyond some critical batch size, larger batch sizes yield diminishing returns. In this work, we study how the critical batch size changes based on properties of the optimization algorithm, including acceleration and preconditioning, through two different lenses: large scale experiments, and analysis of a simple noisy quadratic model (NQM). We experimentally demonstrate that optimization algorithms that employ preconditioning, specifically Adam and KFAC, result in much larger critical batch sizes than stochastic gradient descent with momentum. We also demonstrate that the NQM captures many of the essential features of real neural network training, despite being drastically simpler to work with. The NQM predicts our results with preconditioned optimizers, previous results with accelerated gradient descent, and other results around optimal learning rates and large batch training, making it a useful tool to generate testable predictions about neural network optimization.
07/09/2019 ∙ by Guodong Zhang, et al. ∙ 3 ∙ shareread it

A Kroneckerfactored approximate Fisher matrix for convolution layers
Secondorder optimization methods such as natural gradient descent have the potential to speed up training of neural networks by correcting for the curvature of the loss function. Unfortunately, the exact natural gradient is impractical to compute for large models, and most approximations either require an expensive iterative procedure or make crude approximations to the curvature. We present Kronecker Factors for Convolution (KFC), a tractable approximation to the Fisher matrix for convolutional networks based on a structured probabilistic model for the distribution over backpropagated derivatives. Similarly to the recently proposed KroneckerFactored Approximate Curvature (KFAC), each block of the approximate Fisher matrix decomposes as the Kronecker product of small matrices, allowing for efficient inversion. KFC captures important curvature information while still yielding comparably efficient updates to stochastic gradient descent (SGD). We show that the updates are invariant to commonly used reparameterizations, such as centering of the activations. In our experiments, approximate natural gradient descent with KFC was able to train convolutional networks several times faster than carefully tuned SGD. Furthermore, it was able to train the networks in 1020 times fewer iterations than SGD, suggesting its potential applicability in a distributed setting.
02/03/2016 ∙ by Roger Grosse, et al. ∙ 0 ∙ shareread it

Adding Gradient Noise Improves Learning for Very Deep Networks
Deep feedforward and recurrent networks have achieved impressive results in many perception and language processing applications. This success is partially attributed to architectural innovations such as convolutional and long shortterm memory networks. The main motivation for these architectural innovations is that they capture better domain knowledge, and importantly are easier to optimize than more basic architectures. Recently, more complex architectures such as Neural Turing Machines and Memory Networks have been proposed for tasks including question answering and general computation, creating a new set of optimization challenges. In this paper, we discuss a lowoverhead and easytoimplement technique of adding gradient noise which we find to be surprisingly effective when training these very deep architectures. The technique not only helps to avoid overfitting, but also can result in lower training loss. This method alone allows a fullyconnected 20layer deep network to be trained with standard gradient descent, even starting from a poor initialization. We see consistent improvements for many complex models, including a 72 baseline on a challenging questionanswering task, and a doubling of the number of accurate binary multiplication models learned across 7,000 random restarts. We encourage further application of this technique to additional complex modern architectures.
11/21/2015 ∙ by Arvind Neelakantan, et al. ∙ 0 ∙ shareread it

Optimizing Neural Networks with Kroneckerfactored Approximate Curvature
We propose an efficient method for approximating natural gradient descent in neural networks which we call KroneckerFactored Approximate Curvature (KFAC). KFAC is based on an efficiently invertible approximation of a neural network's Fisher information matrix which is neither diagonal nor lowrank, and in some cases is completely nonsparse. It is derived by approximating various large blocks of the Fisher (corresponding to entire layers) as being the Kronecker product of two much smaller matrices. While only several times more expensive to compute than the plain stochastic gradient, the updates produced by KFAC make much more progress optimizing the objective, which results in an algorithm that can be much faster than stochastic gradient descent with momentum in practice. And unlike some previously proposed approximate naturalgradient/Newton methods which use highquality nondiagonal curvature matrices (such as Hessianfree optimization), KFAC works very well in highly stochastic optimization regimes. This is because the cost of storing and inverting KFAC's approximation to the curvature matrix does not depend on the amount of data used to estimate it, which is a feature typically associated only with diagonal or lowrank approximations to the curvature matrix.
03/19/2015 ∙ by James Martens, et al. ∙ 0 ∙ shareread it

New insights and perspectives on the natural gradient method
Natural gradient descent is an optimization method traditionally motivated from the perspective of information geometry, and works well for many applications as an alternative to stochastic gradient descent. In this paper we critically analyze this method and its properties, and show how it can be viewed as a type of approximate 2ndorder optimization method, where the Fisher information matrix can be viewed as an approximation of the Hessian. This perspective turns out to have significant implications for how to design a practical and robust version of the method. Additionally, we make the following contributions to the understanding of natural gradient and 2ndorder methods: a thorough analysis of the convergence speed of stochastic natural gradient descent (and more general stochastic 2ndorder methods) as applied to convex quadratics, a critical examination of the oftused "empirical" approximation of the Fisher matrix, and an analysis of the (approximate) parameterization invariance property possessed by natural gradient methods, which we show still holds for certain choices of the curvature matrix other than the Fisher, but notably not the Hessian.
12/03/2014 ∙ by James Martens, et al. ∙ 0 ∙ shareread it

On the Expressive Efficiency of Sum Product Networks
Sum Product Networks (SPNs) are a recently developed class of deep generative models which compute their associated unnormalized density functions using a special type of arithmetic circuit. When certain sufficient conditions, called the decomposability and completeness conditions (or "D&C" conditions), are imposed on the structure of these circuits, marginal densities and other useful quantities, which are typically intractable for other deep generative models, can be computed by what amounts to a single evaluation of the network (which is a property known as "validity"). However, the effect that the D&C conditions have on the capabilities of D&C SPNs is not well understood. In this work we analyze the D&C conditions, expose the various connections that D&C SPNs have with multilinear arithmetic circuits, and consider the question of how well they can capture various distributions as a function of their size and depth. Among our various contributions is a result which establishes the existence of a relatively simple distribution with fully tractable marginal densities which cannot be efficiently captured by D&C SPNs of any depth, but which can be efficiently captured by various other deep generative models. We also show that with each additional layer of depth permitted, the set of distributions which can be efficiently captured by D&C SPNs grows in size. This kind of "depth hierarchy" property has been widely conjectured to hold for various deep models, but has never been proven for any of them. Some of our other contributions include a new characterization of the D&C conditions as sufficient and necessary ones for a slightly strengthened notion of validity, and various statemachine characterizations of the types of computations that can be performed efficiently by D&C SPNs.
11/27/2014 ∙ by James Martens, et al. ∙ 0 ∙ shareread it

Estimating the Hessian by Backpropagating Curvature
In this work we develop Curvature Propagation (CP), a general technique for efficiently computing unbiased approximations of the Hessian of any function that is computed using a computational graph. At the cost of roughly two gradient evaluations, CP can give a rank1 approximation of the whole Hessian, and can be repeatedly applied to give increasingly precise unbiased estimates of any or all of the entries of the Hessian. Of particular interest is the diagonal of the Hessian, for which no general approach is known to exist that is both efficient and accurate. We show in experiments that CP turns out to work well in practice, giving very accurate estimates of the Hessian of neural networks, for example, with a relatively small amount of work. We also apply CP to Score Matching, where a diagonal of a Hessian plays an integral role in the Score Matching objective, and where it is usually computed exactly using inefficient algorithms which do not scale to larger and more complex models.
06/27/2012 ∙ by James Martens, et al. ∙ 0 ∙ shareread it

The Mechanics of nPlayer Differentiable Games
The cornerstone underpinning deep learning is the guarantee that gradient descent on an objective converges to local minima. Unfortunately, this guarantee fails in settings, such as generative adversarial nets, where there are multiple interacting losses. The behavior of gradientbased methods in games is not well understood  and is becoming increasingly important as adversarial and multiobjective architectures proliferate. In this paper, we develop new techniques to understand and control the dynamics in general games. The key result is to decompose the secondorder dynamics into two components. The first is related to potential games, which reduce to gradient descent on an implicit function; the second relates to Hamiltonian games, a new class of games that obey a conservation law, akin to conservation laws in classical mechanical systems. The decomposition motivates Symplectic Gradient Adjustment (SGA), a new algorithm for finding stable fixed points in general games. Basic experiments show SGA is competitive with recently proposed algorithms for finding local Nash equilibria in GANs  whilst at the same time being applicable to  and having guarantees in  much more general games.
02/15/2018 ∙ by David Balduzzi, et al. ∙ 0 ∙ shareread it

On the Variance of Unbiased Online Recurrent Optimization
The recently proposed Unbiased Online Recurrent Optimization algorithm (UORO, arXiv:1702.05043) uses an unbiased approximation of RTRL to achieve fully online gradientbased learning in RNNs. In this work we analyze the variance of the gradient estimate computed by UORO, and propose several possible changes to the method which reduce this variance both in theory and practice. We also contribute significantly to the theoretical and intuitive understanding of UORO (and its existing variance reduction technique), and demonstrate a fundamental connection between its gradient estimate and the one that would be computed by REINFORCE if small amounts of noise were added to the RNN's hidden units.
02/06/2019 ∙ by Tim Cooijmans, et al. ∙ 0 ∙ shareread it
James Martens
is this you? claim profile