
Fast Convergence of Natural Gradient Descent for Overparameterized Neural Networks
Natural gradient descent has proven effective at mitigating the effects of pathological curvature in neural network optimization, but little is known theoretically about its convergence properties, especially for nonlinear networks. In this work, we analyze for the first time the speed of convergence for natural gradient descent on nonlinear neural networks with the squarederror loss. We identify two conditions which guarantee the efficient convergence from random initializations: (1) the Jacobian matrix (of network's output for all training cases with respect to the parameters) is full row rank, and (2) the Jacobian matrix is stable for small perturbations around the initialization. For twolayer ReLU neural networks (i.e., with one hidden layer), we prove that these two conditions do in fact hold throughout the training, under the assumptions of nondegenerate inputs and overparameterization. We further extend our analysis to more general loss functions. Lastly, we show that KFAC, an approximate natural gradient descent method, also converges to global minima under the same assumptions.
05/27/2019 ∙ by Guodong Zhang, et al. ∙ 35 ∙ shareread it

Functional Variational Bayesian Neural Networks
Variational Bayesian neural networks (BNNs) perform variational inference over weights, but it is difficult to specify meaningful priors and approximate posteriors in a highdimensional weight space. We introduce functional variational Bayesian neural networks (fBNNs), which maximize an Evidence Lower BOund (ELBO) defined directly on stochastic processes, i.e. distributions over functions. We prove that the KL divergence between stochastic processes equals the supremum of marginal KL divergences over all finite sets of inputs. Based on this, we introduce a practical training objective which approximates the functional ELBO using finite measurement sets and the spectral Stein gradient estimator. With fBNNs, we can specify priors entailing rich structures, including Gaussian processes and implicit stochastic processes. Empirically, we find fBNNs extrapolate well using various structured priors, provide reliable uncertainty estimates, and scale to large datasets.
03/14/2019 ∙ by Shengyang Sun, et al. ∙ 34 ∙ shareread it

EigenDamage: Structured Pruning in the KroneckerFactored Eigenbasis
Reducing the test time resource requirements of a neural network while preserving test accuracy is crucial for running inference on resourceconstrained devices. To achieve this goal, we introduce a novel network reparameterization based on the Kroneckerfactored eigenbasis (KFE), and then apply Hessianbased structured pruning methods in this basis. As opposed to existing Hessianbased pruning algorithms which do pruning in parameter coordinates, our method works in the KFE where different weights are approximately independent, enabling accurate pruning and fast computation. We demonstrate empirically the effectiveness of the proposed method through extensive experiments. In particular, we highlight that the improvements are especially significant for more challenging datasets and networks. With negligible loss of accuracy, an iterativepruning version gives a 10× reduction in model size and a 8× reduction in FLOPs on wide ResNet32.
05/15/2019 ∙ by Chaoqi Wang, et al. ∙ 10 ∙ shareread it

Eigenvalue Corrected Noisy Natural Gradient
Variational Bayesian neural networks combine the flexibility of deep learning with Bayesian uncertainty estimation. However, inference procedures for flexible variational posteriors are computationally expensive. A recently proposed method, noisy natural gradient, is a surprisingly simple method to fit expressive posteriors by adding weight noise to regular natural gradient updates. Noisy KFAC is an instance of noisy natural gradient that fits a matrixvariate Gaussian posterior with minor changes to ordinary KFAC. Nevertheless, a matrixvariate Gaussian posterior does not capture an accurate diagonal variance. In this work, we extend on noisy KFAC to obtain a more flexible posterior distribution called eigenvalue corrected matrixvariate Gaussian. The proposed method computes the full diagonal rescaling factor in Kroneckerfactored eigenbasis. Empirically, our approach consistently outperforms existing algorithms (e.g., noisy KFAC) on regression and classification tasks.
11/30/2018 ∙ by Juhan Bae, et al. ∙ 4 ∙ shareread it

Differentiable Compositional Kernel Learning for Gaussian Processes
The generalization properties of Gaussian processes depend heavily on the choice of kernel, and this choice remains a dark art. We present the Neural Kernel Network (NKN), a flexible family of kernels represented by a neural network. The NKN architecture is based on the composition rules for kernels, so that each unit of the network corresponds to a valid kernel. It can compactly approximate compositional kernel structures such as those used by the Automatic Statistician (Lloyd et al., 2014), but because the architecture is differentiable, it is endtoend trainable with gradientbased optimization. We show that the NKN is universal for the class of stationary kernels. Empirically we demonstrate pattern discovery and extrapolation abilities of NKN on several tasks that depend crucially on identifying the underlying structure, including time series and texture extrapolation, as well as Bayesian optimization.
06/12/2018 ∙ by Shengyang Sun, et al. ∙ 2 ∙ shareread it

Three Mechanisms of Weight Decay Regularization
Weight decay is one of the standard tricks in the neural network toolbox, but the reasons for its regularization effect are poorly understood, and recent results have cast doubt on the traditional interpretation in terms of L_2 regularization. Literal weight decay has been shown to outperform L_2 regularization for optimizers for which they differ. We empirically investigate weight decay for three optimization algorithms (SGD, Adam, and KFAC) and a variety of network architectures. We identify three distinct mechanisms by which weight decay exerts a regularization effect, depending on the particular optimization algorithm and architecture: (1) increasing the effective learning rate, (2) approximately regularizing the inputoutput Jacobian norm, and (3) reducing the effective damping coefficient for secondorder optimization. Our results provide insight into how to improve the regularization of neural networks.
10/29/2018 ∙ by Guodong Zhang, et al. ∙ 2 ∙ shareread it

Deformable Convolutional Networks
Convolutional neural networks (CNNs) are inherently limited to model geometric transformations due to the fixed geometric structures in its building modules. In this work, we introduce two new modules to enhance the transformation modeling capacity of CNNs, namely, deformable convolution and deformable RoI pooling. Both are based on the idea of augmenting the spatial sampling locations in the modules with additional offsets and learning the offsets from target tasks, without additional supervision. The new modules can readily replace their plain counterparts in existing CNNs and can be easily trained endtoend by standard backpropagation, giving rise to deformable convolutional networks. Extensive experiments validate the effectiveness of our approach on sophisticated vision tasks of object detection and semantic segmentation. The code would be released.
03/17/2017 ∙ by Jifeng Dai, et al. ∙ 0 ∙ shareread it

Noisy Natural Gradient as Variational Inference
Combining the flexibility of deep learning with Bayesian uncertainty estimation has long been a goal in our field, and many modern approaches are based on variational Bayes. Unfortunately, one is forced to choose between overly simplistic variational families (e.g. fully factorized) or expensive and complicated inference procedures. We show that natural gradient ascent with adaptive weight noise can be interpreted as fitting a variational posterior to maximize the evidence lower bound (ELBO). This insight allows us to train full covariance, fully factorized, and matrix variate Gaussian variational posteriors using noisy versions of natural gradient, Adam, and KFAC, respectively. On standard regression benchmarks, our noisy KFAC algorithm makes better predictions and matches HMC's predictive variances better than existing methods. Its improved uncertainty estimates lead to more efficient exploration in the settings of active learning and intrinsic motivation for reinforcement learning.
12/06/2017 ∙ by Guodong Zhang, et al. ∙ 0 ∙ shareread it

Interplay Between Optimization and Generalization of Stochastic Gradient Descent with Covariance Noise
The choice of batchsize in a stochastic optimization algorithm plays a substantial role for both optimization and generalization. Increasing the batchsize used typically improves optimization but degrades generalization. To address the problem of improving generalization while maintaining optimal convergence in largebatch training, we propose to add covariance noise to the gradients. We demonstrate that the optimization performance of our method is more accurately captured by the structure of the noise covariance matrix rather than by the variance of gradients. Moreover, over the convexquadratic, we prove in theory that it can be characterized by the Frobenius norm of the noise matrix. Our empirical studies with standard deep learning modelarchitectures and datasets shows that our method not only improves generalization performance in largebatch training, but furthermore, does so in a way where the optimization performance remains desirable and the training duration is not elongated.
02/21/2019 ∙ by Yeming Wen, et al. ∙ 0 ∙ shareread it

Computational Design of Finite Strain Auxetic Metamaterials via Topology Optimization and Nonlinear Homogenization
A novel computational framework for designing metamaterials with negative Poisson's ratio over a large strain range is presented in this work by combining the densitybased topology optimization together with a mixed stress/deformation driven nonlinear homogenization method. A measure of Poisson's ratio based on the macro deformations is proposed, which is further validated through direct numerical simulations. With the consistent optimization formulations based on nonlinear homogenization, auxetic metamaterial designs with respect to different loading orientations and with different unit cell domains are systematically explored. In addition, the extension to multimaterial auxetic metamaterial designs is also considered, and stable optimization formulations are presented to obtain discrete metamaterial topologies under finite strains. Various new auxetic designs are obtained based on the proposed framework. To validate the performance of optimized designs, a multiscale stability analysis is carried out using the Bloch wave method and rankone convexity check. As demonstrated, short and/or long wavelength instabilities can occur during the loading process, leading to a change of periodicity of the microstructure, which can affect the performance of an optimized design.
04/25/2019 ∙ by Guodong Zhang, et al. ∙ 0 ∙ shareread it
Guodong Zhang
is this you? claim profile