
Bayesian Convolutional Neural Networks with Many Channels are Gaussian Processes
There is a previously identified equivalence between wide fully connected neural networks (FCNs) and Gaussian processes (GPs). This equivalence enables, for instance, test set predictions that would have resulted from a fully Bayesian, infinitely wide trained FCN to be computed without ever instantiating the FCN, but by instead evaluating the corresponding GP. In this work, we derive an analogous equivalence for multilayer convolutional neural networks (CNNs) both with and without pooling layers, and achieve state of the art results on CIFAR10 for GPs without trainable kernels. We also introduce a Monte Carlo method to estimate the GP corresponding to a given neural network architecture, even in cases where the analytic form has too many terms to be computationally feasible. Surprisingly, in the absence of pooling layers, the GPs corresponding to CNNs with and without weight sharing are identical. As a consequence, translation equivariance in finitechannel CNNs trained with stochastic gradient descent (SGD) has no corresponding property in the Bayesian treatment of the infinite channel limit  a qualitative difference between the two regimes that is not present in the FCN case. We confirm experimentally, that while in some scenarios the performance of SGDtrained finite CNNs approaches that of the corresponding GPs as the channel count increases, with careful tuning SGDtrained CNNs can significantly outperform their corresponding GPs, suggesting advantages from SGD training compared to fully Bayesian parameter estimation.
10/11/2018 ∙ by Roman Novak, et al. ∙ 18 ∙ shareread it

Adversarial Reprogramming of Neural Networks
Deep neural networks are susceptible to adversarial attacks. In computer vision, wellcrafted perturbations to images can cause neural networks to make mistakes such as identifying a panda as a gibbon or confusing a cat with a computer. Previous adversarial examples have been designed to degrade performance of models or cause machine learning models to produce specific outputs chosen ahead of time by the attacker. We introduce adversarial attacks that instead reprogram the target model to perform a task chosen by the attackerwithout the attacker needing to specify or compute the desired output for each testtime input. This attack is accomplished by optimizing for a single adversarial perturbation, of unrestricted magnitude, that can be added to all testtime inputs to a machine learning model in order to cause the model to perform a task chosen by the adversary when processing these inputseven if the model was not trained to do this task. These perturbations can be thus considered a program for the new task. We demonstrate adversarial reprogramming on six ImageNet classification models, repurposing these models to perform a counting task, as well as two classification tasks: classification of MNIST and CIFAR10 examples presented within the input to the ImageNet model.
06/28/2018 ∙ by Gamaleldin F. Elsayed, et al. ∙ 10 ∙ shareread it

Learned optimizers that outperform SGD on wallclock and test loss
Deep learning has shown that learned functions can dramatically outperform handdesigned functions on perceptual tasks. Analogously, this suggests that learned optimizers may similarly outperform current handdesigned optimizers, especially for specific problems. However, learned optimizers are notoriously difficult to train and have yet to demonstrate wallclock speedups over handdesigned optimizers, and thus are rarely used in practice. Typically, learned optimizers are trained by truncated backpropagation through an unrolled optimization process. The resulting gradients are either strongly biased (for short truncations) or have exploding norm (for long truncations). In this work we propose a training scheme which overcomes both of these difficulties, by dynamically weighting two unbiased gradient estimators for a variational loss on optimizer performance. This allows us to train neural networks to perform optimization of a specific task faster than well tuned firstorder methods. Moreover, by training the optimizer against validation loss (as opposed to training loss), we are able to learn optimizers that train networks to better generalization than first order methods. We demonstrate these results on problems where our learned optimizer trains convolutional networks in a fifth of the wallclock time compared to tuned firstorder methods, and with an improvement in test loss.
10/24/2018 ∙ by Luke Metz, et al. ∙ 8 ∙ shareread it

A RAD approach to deep mixture models
Flow based models such as Real NVP are an extremely powerful approach to density estimation. However, existing flow based models are restricted to transforming continuous densities over a continuous input space into similarly continuous distributions over continuous latent variables. This makes them poorly suited for modeling and representing discrete structures in data distributions, for example class membership or discrete symmetries. To address this difficulty, we present a normalizing flow architecture which relies on domain partitioning using locally invertible functions, and possesses both real and discrete valued latent variables. This Real and Discrete (RAD) approach retains the desirable normalizing flow properties of exact sampling, exact inference, and analytically computable probabilities, while at the same time allowing simultaneous modeling of both continuous and discrete structure in a data distribution.
03/18/2019 ∙ by Laurent Dinh, et al. ∙ 4 ∙ shareread it

Guided evolutionary strategies: escaping the curse of dimensionality in random search
Many applications in machine learning require optimizing a function whose true gradient is unknown, but where surrogate gradient information (directions that may be correlated with, but not necessarily identical to, the true gradient) is available instead. This arises when an approximate gradient is easier to compute than the full gradient (e.g. in metalearning or unrolled optimization), or when a true gradient is intractable and is replaced with a surrogate (e.g. in certain reinforcement learning applications, or when using synthetic gradients). We propose Guided Evolutionary Strategies, a method for optimally using surrogate gradient directions along with random search. We define a search distribution for evolutionary strategies that is elongated along a guiding subspace spanned by the surrogate gradients. This allows us to estimate a descent direction which can then be passed to a firstorder optimizer. We analytically and numerically characterize the tradeoffs that result from tuning how strongly the search distribution is stretched along the guiding subspace, and we use this to derive a setting of the hyperparameters that works well across problems. Finally, we apply our method to example problems including truncated unrolled optimization and a synthetic gradient problem, demonstrating improvement over both standard evolutionary strategies and firstorder methods that directly follow the surrogate gradient. We provide a demo of Guided ES at: https://github.com/brainresearch/guidedevolutionarystrategies.
06/26/2018 ∙ by Niru Maheswaranathan, et al. ∙ 2 ∙ shareread it

Learned optimizers that outperform SGD on wallclock and validation loss
Deep learning has shown that learned functions can dramatically outperform handdesigned functions on perceptual tasks. Analogously, this suggests that learned optimizers may similarly outperform current handdesigned optimizers, especially for specific problems. However, learned optimizers are notoriously difficult to train and have yet to demonstrate wallclock speedups over handdesigned optimizers, and thus are rarely used in practice. Typically, learned optimizers are trained by truncated backpropagation through an unrolled optimization process. The resulting gradients are either strongly biased (for short truncations) or have exploding norm (for long truncations). In this work we propose a training scheme which overcomes both of these difficulties, by dynamically weighting two unbiased gradient estimators for a variational loss on optimizer performance. This allows us to train neural networks to perform optimization of a specific task faster than well tuned firstorder methods. Moreover, by training the optimizer against validation loss (as opposed to training loss), we are able to learn optimizers that train networks to better generalization than first order methods. We demonstrate these results on problems where our learned optimizer trains convolutional networks in a fifth of the wallclock time compared to tuned firstorder methods, and with an improvement in validation loss.
10/24/2018 ∙ by Luke Metz, et al. ∙ 2 ∙ shareread it

Deep Neural Networks as Gaussian Processes
A deep fullyconnected neural network with an i.i.d. prior over its parameters is equivalent to a Gaussian process (GP) in the limit of infinite network width. This correspondence enables exact Bayesian inference for neural networks on regression tasks by means of straightforward matrix computations. For single hiddenlayer networks, the covariance function of this GP has long been known. Recently, kernel functions for multilayer random neural networks have been developed, but only outside of a Bayesian framework. As such, previous work has not identified the correspondence between using these kernels as the covariance function for a GP and performing fully Bayesian prediction with a deep neural network. In this work, we derive this correspondence and develop a computationally efficient pipeline to compute the covariance functions. We then use the resulting GP to perform Bayesian inference for deep neural networks on MNIST and CIFAR10. We find that the GPbased predictions are competitive and can outperform neural networks trained with stochastic gradient descent. We observe that the trained neural network accuracy approaches that of the corresponding GPbased computation with increasing layer width, and that the GP uncertainty is strongly correlated with prediction error. We connect our observations to the recent development of signal propagation in random neural networks.
11/01/2017 ∙ by Jaehoon Lee, et al. ∙ 0 ∙ shareread it

A Correspondence Between Random Neural Networks and Statistical Field Theory
A number of recent papers have provided evidence that practical design questions about neural networks may be tackled theoretically by studying the behavior of random networks. However, until now the tools available for analyzing random neural networks have been relatively adhoc. In this work, we show that the distribution of preactivations in random neural networks can be exactly mapped onto lattice models in statistical physics. We argue that several previous investigations of stochastic networks actually studied a particular factorial approximation to the full lattice model. For random linear networks and random rectified linear networks we show that the corresponding lattice models in the wide network limit may be systematically approximated by a Gaussian distribution with covariance between the layers of the network. In each case, the approximate distribution can be diagonalized by Fourier transformation. We show that this approximation accurately describes the results of numerical simulations of wide random neural networks. Finally, we demonstrate that in each case the large scale behavior of the random networks can be approximated by an effective field theory.
10/18/2017 ∙ by Samuel S. Schoenholz, et al. ∙ 0 ∙ shareread it

Survey of Expressivity in Deep Neural Networks
We survey results on neural network expressivity described in "On the Expressive Power of Deep Neural Networks". The paper motivates and develops three natural measures of expressiveness, which all display an exponential dependence on the depth of the network. In fact, all of these measures are related to a fourth quantity, trajectory length. This quantity grows exponentially in the depth of the network, and is responsible for the depth sensitivity observed. These results translate to consequences for networks during and after training. They suggest that parameters earlier in a network have greater influence on its expressive power  in particular, given a layer, its influence on expressivity is determined by the remaining depth of the network after that layer. This is verified with experiments on MNIST and CIFAR10. We also explore the effect of training on the inputoutput map, and find that it trades off between the stability and expressivity.
11/24/2016 ∙ by Maithra Raghu, et al. ∙ 0 ∙ shareread it

SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability
We propose a new technique, Singular Vector Canonical Correlation Analysis (SVCCA), a tool for quickly comparing two representations in a way that is both invariant to affine transform (allowing comparison between different layers and networks) and fast to compute (allowing more comparisons to be calculated than with previous methods). We deploy this tool to measure the intrinsic dimensionality of layers, showing in some cases needless overparameterization; to probe learning dynamics throughout training, finding that networks converge to final representations from the bottom up; to show where classspecific information in networks is formed; and to suggest new training regimes that simultaneously save computation and overfit less. Code: https://github.com/google/svcca/
06/19/2017 ∙ by Maithra Raghu, et al. ∙ 0 ∙ shareread it

Capacity and Trainability in Recurrent Neural Networks
Two potential bottlenecks on the expressiveness of recurrent neural networks (RNNs) are their ability to store information about the task in their parameters, and to store information about the input history in their units. We show experimentally that all common RNN architectures achieve nearly the same pertask and perunit capacity bounds with careful training, for a variety of tasks and stacking depths. They can store an amount of task information which is linear in the number of parameters, and is approximately 5 bits per parameter. They can additionally store approximately one real number from their input history per hidden unit. We further find that for several tasks it is the pertask parameter capacity bound that determines performance. These results suggest that many previous results comparing RNN architectures are driven primarily by differences in training effectiveness, rather than differences in capacity. Supporting this observation, we compare training difficulty for several architectures, and show that vanilla RNNs are far more difficult to train, yet have slightly higher capacity. Finally, we propose two novel RNN architectures, one of which is easier to train than the LSTM or GRU for deeply stacked architectures.
11/29/2016 ∙ by Jasmine Collins, et al. ∙ 0 ∙ shareread it
Jascha SohlDickstein
is this you? claim profile
Staff Research Scientist in the Brain group at Google, Academic Resident at Khan Academy, Visiting scholar in Surya Ganguli's lab at Stanford University, PhD in 2012 in the Redwood Center for Theoretical Neuroscience at UC Berkeley, in Bruno Olshausen's lab.