
Fast Convergence of Natural Gradient Descent for Overparameterized Neural Networks
Natural gradient descent has proven effective at mitigating the effects of pathological curvature in neural network optimization, but little is known theoretically about its convergence properties, especially for nonlinear networks. In this work, we analyze for the first time the speed of convergence for natural gradient descent on nonlinear neural networks with the squarederror loss. We identify two conditions which guarantee the efficient convergence from random initializations: (1) the Jacobian matrix (of network's output for all training cases with respect to the parameters) is full row rank, and (2) the Jacobian matrix is stable for small perturbations around the initialization. For twolayer ReLU neural networks (i.e., with one hidden layer), we prove that these two conditions do in fact hold throughout the training, under the assumptions of nondegenerate inputs and overparameterization. We further extend our analysis to more general loss functions. Lastly, we show that KFAC, an approximate natural gradient descent method, also converges to global minima under the same assumptions.
05/27/2019 ∙ by Guodong Zhang, et al. ∙ 35 ∙ shareread it

Functional Variational Bayesian Neural Networks
Variational Bayesian neural networks (BNNs) perform variational inference over weights, but it is difficult to specify meaningful priors and approximate posteriors in a highdimensional weight space. We introduce functional variational Bayesian neural networks (fBNNs), which maximize an Evidence Lower BOund (ELBO) defined directly on stochastic processes, i.e. distributions over functions. We prove that the KL divergence between stochastic processes equals the supremum of marginal KL divergences over all finite sets of inputs. Based on this, we introduce a practical training objective which approximates the functional ELBO using finite measurement sets and the spectral Stein gradient estimator. With fBNNs, we can specify priors entailing rich structures, including Gaussian processes and implicit stochastic processes. Empirically, we find fBNNs extrapolate well using various structured priors, provide reliable uncertainty estimates, and scale to large datasets.
03/14/2019 ∙ by Shengyang Sun, et al. ∙ 34 ∙ shareread it

SelfTuning Networks: Bilevel Optimization of Hyperparameters using Structured BestResponse Functions
Hyperparameter optimization can be formulated as a bilevel optimization problem, where the optimal parameters on the training set depend on the hyperparameters. We aim to adapt regularization hyperparameters for neural networks by fitting compact approximations to the bestresponse function, which maps hyperparameters to optimal weights and biases. We show how to construct scalable bestresponse approximations for neural networks by modeling the bestresponse as a single network whose hidden units are gated conditionally on the regularizer. We justify this approximation by showing the exact bestresponse for a shallow linear network with L2regularized Jacobian can be represented by a similar gating mechanism. We fit this model using a gradientbased hyperparameter optimization algorithm which alternates between approximating the bestresponse around the current hyperparameters and optimizing the hyperparameters using the approximate bestresponse function. Unlike other gradientbased approaches, we do not require differentiating the training loss with respect to the hyperparameters, allowing us to tune discrete hyperparameters, data augmentation hyperparameters, and dropout probabilities. Because the hyperparameters are adapted online, our approach discovers hyperparameter schedules that can outperform fixed hyperparameter values. Empirically, our approach outperforms competing hyperparameter optimization methods on largescale deep learning problems. We call our networks, which update their own hyperparameters online during training, SelfTuning Networks (STNs).
03/07/2019 ∙ by Matthew MacKay, et al. ∙ 26 ∙ shareread it

EigenDamage: Structured Pruning in the KroneckerFactored Eigenbasis
Reducing the test time resource requirements of a neural network while preserving test accuracy is crucial for running inference on resourceconstrained devices. To achieve this goal, we introduce a novel network reparameterization based on the Kroneckerfactored eigenbasis (KFE), and then apply Hessianbased structured pruning methods in this basis. As opposed to existing Hessianbased pruning algorithms which do pruning in parameter coordinates, our method works in the KFE where different weights are approximately independent, enabling accurate pruning and fast computation. We demonstrate empirically the effectiveness of the proposed method through extensive experiments. In particular, we highlight that the improvements are especially significant for more challenging datasets and networks. With negligible loss of accuracy, an iterativepruning version gives a 10× reduction in model size and a 8× reduction in FLOPs on wide ResNet32.
05/15/2019 ∙ by Chaoqi Wang, et al. ∙ 10 ∙ shareread it

Sorting out Lipschitz function approximation
Training neural networks subject to a Lipschitz constraint is useful for generalization bounds, provable adversarial robustness, interpretable gradients, and Wasserstein distance estimation. By the composition property of Lipschitz functions, it suffices to ensure that each individual affine transformation or nonlinear activation function is 1Lipschitz. The challenge is to do this while maintaining the expressive power. We identify a necessary property for such an architecture: each of the layers must preserve the gradient norm during backpropagation. Based on this, we propose to combine a gradient norm preserving activation function, GroupSort, with normconstrained weight matrices. We show that normconstrained GroupSort architectures are universal Lipschitz function approximators. Empirically, we show that normconstrained GroupSort networks achieve tighter estimates of Wasserstein distance than their ReLU counterparts and can achieve provable adversarial robustness guarantees with little cost to accuracy.
11/13/2018 ∙ by Cem Anil, et al. ∙ 8 ∙ shareread it

A CoordinateFree Construction of Scalable Natural Gradient
Most neural networks are trained using firstorder optimization methods, which are sensitive to the parameterization of the model. Natural gradient descent is invariant to smooth reparameterizations because it is defined in a coordinatefree way, but tractable approximations are typically defined in terms of coordinate systems, and hence may lose the invariance properties. We analyze the invariance properties of the KroneckerFactored Approximate Curvature (KFAC) algorithm by constructing the algorithm in a coordinatefree way. We explicitly construct a Riemannian metric under which the natural gradient matches the KFAC update; invariance to affine transformations of the activations follows immediately. We extend our framework to analyze the invariance properties of KFAC applied to convolutional networks and recurrent neural networks, as well as metrics other than the usual Fisher metric.
08/30/2018 ∙ by Kevin Luk, et al. ∙ 6 ∙ shareread it

Reversible Recurrent Neural Networks
Recurrent neural networks (RNNs) provide stateoftheart performance in processing sequential data but are memory intensive to train, limiting the flexibility of RNN models which can be trained. Reversible RNNsRNNs for which the hiddentohidden transition can be reversedoffer a path to reduce the memory requirements of training, as hidden states need not be stored and instead can be recomputed during backpropagation. We first show that perfectly reversible RNNs, which require no storage of the hidden activations, are fundamentally limited because they cannot forget information from their hidden state. We then provide a scheme for storing a small number of bits in order to allow perfect reversal with forgetting. Our method achieves comparable performance to traditional models while reducing the activation memory cost by a factor of 1015. We extend our technique to attentionbased sequencetosequence models, where it maintains performance while reducing activation memory cost by a factor of 510 in the encoder, and a factor of 1015 in the decoder.
10/25/2018 ∙ by Matthew MacKay, et al. ∙ 6 ∙ shareread it

Adversarial Distillation of Bayesian Neural Network Posteriors
Bayesian neural networks (BNNs) allow us to reason about uncertainty in a principled way. Stochastic Gradient Langevin Dynamics (SGLD) enables efficient BNN learning by drawing samples from the BNN posterior using minibatches. However, SGLD and its extensions require storage of many copies of the model parameters, a potentially prohibitive cost, especially for large neural networks. We propose a framework, Adversarial Posterior Distillation, to distill the SGLD samples using a Generative Adversarial Network (GAN). At testtime, samples are generated by the GAN. We show that this distillation framework incurs no loss in performance on recent BNN applications including anomaly detection, active learning, and defense against adversarial attacks. By construction, our framework not only distills the Bayesian predictive distribution, but the posterior itself. This allows one to compute quantities such as the approximate model variance, which is useful in downstream tasks. To our knowledge, these are the first results applying MCMCbased BNNs to the aforementioned downstream applications.
06/27/2018 ∙ by KuanChieh Wang, et al. ∙ 4 ∙ shareread it

Eigenvalue Corrected Noisy Natural Gradient
Variational Bayesian neural networks combine the flexibility of deep learning with Bayesian uncertainty estimation. However, inference procedures for flexible variational posteriors are computationally expensive. A recently proposed method, noisy natural gradient, is a surprisingly simple method to fit expressive posteriors by adding weight noise to regular natural gradient updates. Noisy KFAC is an instance of noisy natural gradient that fits a matrixvariate Gaussian posterior with minor changes to ordinary KFAC. Nevertheless, a matrixvariate Gaussian posterior does not capture an accurate diagonal variance. In this work, we extend on noisy KFAC to obtain a more flexible posterior distribution called eigenvalue corrected matrixvariate Gaussian. The proposed method computes the full diagonal rescaling factor in Kroneckerfactored eigenbasis. Empirically, our approach consistently outperforms existing algorithms (e.g., noisy KFAC) on regression and classification tasks.
11/30/2018 ∙ by Juhan Bae, et al. ∙ 4 ∙ shareread it

Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model
Increasing the batch size is a popular way to speed up neural network training, but beyond some critical batch size, larger batch sizes yield diminishing returns. In this work, we study how the critical batch size changes based on properties of the optimization algorithm, including acceleration and preconditioning, through two different lenses: large scale experiments, and analysis of a simple noisy quadratic model (NQM). We experimentally demonstrate that optimization algorithms that employ preconditioning, specifically Adam and KFAC, result in much larger critical batch sizes than stochastic gradient descent with momentum. We also demonstrate that the NQM captures many of the essential features of real neural network training, despite being drastically simpler to work with. The NQM predicts our results with preconditioned optimizers, previous results with accelerated gradient descent, and other results around optimal learning rates and large batch training, making it a useful tool to generate testable predictions about neural network optimization.
07/09/2019 ∙ by Guodong Zhang, et al. ∙ 3 ∙ shareread it

Differentiable Compositional Kernel Learning for Gaussian Processes
The generalization properties of Gaussian processes depend heavily on the choice of kernel, and this choice remains a dark art. We present the Neural Kernel Network (NKN), a flexible family of kernels represented by a neural network. The NKN architecture is based on the composition rules for kernels, so that each unit of the network corresponds to a valid kernel. It can compactly approximate compositional kernel structures such as those used by the Automatic Statistician (Lloyd et al., 2014), but because the architecture is differentiable, it is endtoend trainable with gradientbased optimization. We show that the NKN is universal for the class of stationary kernels. Empirically we demonstrate pattern discovery and extrapolation abilities of NKN on several tasks that depend crucially on identifying the underlying structure, including time series and texture extrapolation, as well as Bayesian optimization.
06/12/2018 ∙ by Shengyang Sun, et al. ∙ 2 ∙ shareread it
Roger Grosse
is this you? claim profile
Assistant Professor of Computer Science at the University of Toronto, focusing on machine learning.