
Fast Convergence of Natural Gradient Descent for Overparameterized Neural Networks
Natural gradient descent has proven effective at mitigating the effects of pathological curvature in neural network optimization, but little is known theoretically about its convergence properties, especially for nonlinear networks. In this work, we analyze for the first time the speed of convergence for natural gradient descent on nonlinear neural networks with the squarederror loss. We identify two conditions which guarantee the efficient convergence from random initializations: (1) the Jacobian matrix (of network's output for all training cases with respect to the parameters) is full row rank, and (2) the Jacobian matrix is stable for small perturbations around the initialization. For twolayer ReLU neural networks (i.e., with one hidden layer), we prove that these two conditions do in fact hold throughout the training, under the assumptions of nondegenerate inputs and overparameterization. We further extend our analysis to more general loss functions. Lastly, we show that KFAC, an approximate natural gradient descent method, also converges to global minima under the same assumptions.
05/27/2019 ∙ by Guodong Zhang, et al. ∙ 35 ∙ shareread it

Functional Variational Bayesian Neural Networks
Variational Bayesian neural networks (BNNs) perform variational inference over weights, but it is difficult to specify meaningful priors and approximate posteriors in a highdimensional weight space. We introduce functional variational Bayesian neural networks (fBNNs), which maximize an Evidence Lower BOund (ELBO) defined directly on stochastic processes, i.e. distributions over functions. We prove that the KL divergence between stochastic processes equals the supremum of marginal KL divergences over all finite sets of inputs. Based on this, we introduce a practical training objective which approximates the functional ELBO using finite measurement sets and the spectral Stein gradient estimator. With fBNNs, we can specify priors entailing rich structures, including Gaussian processes and implicit stochastic processes. Empirically, we find fBNNs extrapolate well using various structured priors, provide reliable uncertainty estimates, and scale to large datasets.
03/14/2019 ∙ by Shengyang Sun, et al. ∙ 34 ∙ shareread it

Benchmarking ModelBased Reinforcement Learning
Modelbased reinforcement learning (MBRL) is widely seen as having the potential to be significantly more sample efficient than modelfree RL. However, research in modelbased RL has not been very standardized. It is fairly common for authors to experiment with selfdesigned environments, and there are several separate lines of research, which are sometimes closedsourced or not reproducible. Accordingly, it is an open question how these various existing MBRL algorithms perform relative to each other. To facilitate research in MBRL, in this paper we gather a wide collection of MBRL algorithms and propose over 18 benchmarking environments specially designed for MBRL. We benchmark these algorithms with unified problem settings, including noisy environments. Beyond cataloguing performance, we explore and unify the underlying algorithmic differences across MBRL algorithms. We characterize three key research challenges for future MBRL research: the dynamics bottleneck, the planning horizon dilemma, and the earlytermination dilemma. Finally, to maximally facilitate future research on MBRL, we opensource our benchmark in http://www.cs.toronto.edu/ tingwuwang/mbrl.html.
07/03/2019 ∙ by Tingwu Wang, et al. ∙ 28 ∙ shareread it

EigenDamage: Structured Pruning in the KroneckerFactored Eigenbasis
Reducing the test time resource requirements of a neural network while preserving test accuracy is crucial for running inference on resourceconstrained devices. To achieve this goal, we introduce a novel network reparameterization based on the Kroneckerfactored eigenbasis (KFE), and then apply Hessianbased structured pruning methods in this basis. As opposed to existing Hessianbased pruning algorithms which do pruning in parameter coordinates, our method works in the KFE where different weights are approximately independent, enabling accurate pruning and fast computation. We demonstrate empirically the effectiveness of the proposed method through extensive experiments. In particular, we highlight that the improvements are especially significant for more challenging datasets and networks. With negligible loss of accuracy, an iterativepruning version gives a 10× reduction in model size and a 8× reduction in FLOPs on wide ResNet32.
05/15/2019 ∙ by Chaoqi Wang, et al. ∙ 10 ∙ shareread it

Eigenvalue Corrected Noisy Natural Gradient
Variational Bayesian neural networks combine the flexibility of deep learning with Bayesian uncertainty estimation. However, inference procedures for flexible variational posteriors are computationally expensive. A recently proposed method, noisy natural gradient, is a surprisingly simple method to fit expressive posteriors by adding weight noise to regular natural gradient updates. Noisy KFAC is an instance of noisy natural gradient that fits a matrixvariate Gaussian posterior with minor changes to ordinary KFAC. Nevertheless, a matrixvariate Gaussian posterior does not capture an accurate diagonal variance. In this work, we extend on noisy KFAC to obtain a more flexible posterior distribution called eigenvalue corrected matrixvariate Gaussian. The proposed method computes the full diagonal rescaling factor in Kroneckerfactored eigenbasis. Empirically, our approach consistently outperforms existing algorithms (e.g., noisy KFAC) on regression and classification tasks.
11/30/2018 ∙ by Juhan Bae, et al. ∙ 4 ∙ shareread it

Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model
Increasing the batch size is a popular way to speed up neural network training, but beyond some critical batch size, larger batch sizes yield diminishing returns. In this work, we study how the critical batch size changes based on properties of the optimization algorithm, including acceleration and preconditioning, through two different lenses: large scale experiments, and analysis of a simple noisy quadratic model (NQM). We experimentally demonstrate that optimization algorithms that employ preconditioning, specifically Adam and KFAC, result in much larger critical batch sizes than stochastic gradient descent with momentum. We also demonstrate that the NQM captures many of the essential features of real neural network training, despite being drastically simpler to work with. The NQM predicts our results with preconditioned optimizers, previous results with accelerated gradient descent, and other results around optimal learning rates and large batch training, making it a useful tool to generate testable predictions about neural network optimization.
07/09/2019 ∙ by Guodong Zhang, et al. ∙ 3 ∙ shareread it

Differentiable Compositional Kernel Learning for Gaussian Processes
The generalization properties of Gaussian processes depend heavily on the choice of kernel, and this choice remains a dark art. We present the Neural Kernel Network (NKN), a flexible family of kernels represented by a neural network. The NKN architecture is based on the composition rules for kernels, so that each unit of the network corresponds to a valid kernel. It can compactly approximate compositional kernel structures such as those used by the Automatic Statistician (Lloyd et al., 2014), but because the architecture is differentiable, it is endtoend trainable with gradientbased optimization. We show that the NKN is universal for the class of stationary kernels. Empirically we demonstrate pattern discovery and extrapolation abilities of NKN on several tasks that depend crucially on identifying the underlying structure, including time series and texture extrapolation, as well as Bayesian optimization.
06/12/2018 ∙ by Shengyang Sun, et al. ∙ 2 ∙ shareread it

Three Mechanisms of Weight Decay Regularization
Weight decay is one of the standard tricks in the neural network toolbox, but the reasons for its regularization effect are poorly understood, and recent results have cast doubt on the traditional interpretation in terms of L_2 regularization. Literal weight decay has been shown to outperform L_2 regularization for optimizers for which they differ. We empirically investigate weight decay for three optimization algorithms (SGD, Adam, and KFAC) and a variety of network architectures. We identify three distinct mechanisms by which weight decay exerts a regularization effect, depending on the particular optimization algorithm and architecture: (1) increasing the effective learning rate, (2) approximately regularizing the inputoutput Jacobian norm, and (3) reducing the effective damping coefficient for secondorder optimization. Our results provide insight into how to improve the regularization of neural networks.
10/29/2018 ∙ by Guodong Zhang, et al. ∙ 2 ∙ shareread it

Deformable Convolutional Networks
Convolutional neural networks (CNNs) are inherently limited to model geometric transformations due to the fixed geometric structures in its building modules. In this work, we introduce two new modules to enhance the transformation modeling capacity of CNNs, namely, deformable convolution and deformable RoI pooling. Both are based on the idea of augmenting the spatial sampling locations in the modules with additional offsets and learning the offsets from target tasks, without additional supervision. The new modules can readily replace their plain counterparts in existing CNNs and can be easily trained endtoend by standard backpropagation, giving rise to deformable convolutional networks. Extensive experiments validate the effectiveness of our approach on sophisticated vision tasks of object detection and semantic segmentation. The code would be released.
03/17/2017 ∙ by Jifeng Dai, et al. ∙ 0 ∙ shareread it

Noisy Natural Gradient as Variational Inference
Combining the flexibility of deep learning with Bayesian uncertainty estimation has long been a goal in our field, and many modern approaches are based on variational Bayes. Unfortunately, one is forced to choose between overly simplistic variational families (e.g. fully factorized) or expensive and complicated inference procedures. We show that natural gradient ascent with adaptive weight noise can be interpreted as fitting a variational posterior to maximize the evidence lower bound (ELBO). This insight allows us to train full covariance, fully factorized, and matrix variate Gaussian variational posteriors using noisy versions of natural gradient, Adam, and KFAC, respectively. On standard regression benchmarks, our noisy KFAC algorithm makes better predictions and matches HMC's predictive variances better than existing methods. Its improved uncertainty estimates lead to more efficient exploration in the settings of active learning and intrinsic motivation for reinforcement learning.
12/06/2017 ∙ by Guodong Zhang, et al. ∙ 0 ∙ shareread it

Interplay Between Optimization and Generalization of Stochastic Gradient Descent with Covariance Noise
The choice of batchsize in a stochastic optimization algorithm plays a substantial role for both optimization and generalization. Increasing the batchsize used typically improves optimization but degrades generalization. To address the problem of improving generalization while maintaining optimal convergence in largebatch training, we propose to add covariance noise to the gradients. We demonstrate that the optimization performance of our method is more accurately captured by the structure of the noise covariance matrix rather than by the variance of gradients. Moreover, over the convexquadratic, we prove in theory that it can be characterized by the Frobenius norm of the noise matrix. Our empirical studies with standard deep learning modelarchitectures and datasets shows that our method not only improves generalization performance in largebatch training, but furthermore, does so in a way where the optimization performance remains desirable and the training duration is not elongated.
02/21/2019 ∙ by Yeming Wen, et al. ∙ 0 ∙ shareread it
Guodong Zhang
is this you? claim profile