
Gradient Descent Provably Optimizes Overparameterized Neural Networks
One of the mystery in the success of neural networks is randomly initialized first order methods like gradient descent can achieve zero training loss even though the objective function is nonconvex and nonsmooth. This paper demystifies this surprising phenomenon for twolayer fully connected ReLU activated neural networks. For an m hidden node shallow neural network with ReLU activation and n training data, we show as long as m is large enough and the data is nondegenerate, randomly initialized gradient descent converges a globally optimal solution with a linear convergence rate for the quadratic loss function. Our analysis is based on the following observation: overparameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows us to exploit a strong convexitylike property to show that gradient descent converges at a global linear rate to the global optimum. We believe these insights are also useful in analyzing deep models and other first order methods.
10/04/2018 ∙ by Simon S. Du, et al. ∙ 30 ∙ shareread it

A Deep Reinforcement Learning Approach for Global Routing
Global routing has been a historically challenging problem in electronic circuit design, where the challenge is to connect a large and arbitrary number of circuit components with wires without violating the design rules for the printed circuit boards or integrated circuits. Similar routing problems also exist in the design of complex hydraulic systems, pipe systems and logistic networks. Existing solutions typically consist of greedy algorithms and hardcoded heuristics. As such, existing approaches suffer from a lack of model flexibility and nonoptimum solutions. As an alternative approach, this work presents a deep reinforcement learning method for solving the global routing problem in a simulated environment. At the heart of the proposed method is deep reinforcement learning that enables an agent to produce an optimal policy for routing based on the variety of problems it is presented with leveraging the conjoint optimization mechanism of deep reinforcement learning. Conjoint optimization mechanism is explained and demonstrated in details; the best network structure and the parameters of the learned model are explored. Based on the finetuned model, routing solutions and rewards are presented and analyzed. The results indicate that the approach can outperform the benchmark method of a sequential A* method, suggesting a promising potential for deep reinforcement learning for global routing and other routing or path planning problems in general. Another major contribution of this work is the development of a global routing problem sets generator with the ability to generate parameterized global routing problem sets with different size and constraints, enabling evaluation of different routing algorithms and the generation of training datasets for future datadriven routing approaches.
06/20/2019 ∙ by Haiguang Liao, et al. ∙ 19 ∙ shareread it

ProBO: a Framework for Using Probabilistic Programming in Bayesian Optimization
Optimizing an expensivetoquery function is a common task in science and engineering, where it is beneficial to keep the number of queries to a minimum. A popular strategy is Bayesian optimization (BO), which leverages probabilistic models for this task. Most BO today uses Gaussian processes (GPs), or a few other surrogate models. However, there is a broad set of Bayesian modeling techniques that we may want to use to capture complex systems and reduce the number of queries. Probabilistic programs (PPs) are modern tools that allow for flexible model composition, incorporation of prior information, and automatic inference. In this paper, we develop ProBO, a framework for BO using only standard operations common to most PPs. This allows a user to drop in an arbitrary PP implementation and use it directly in BO. To do this, we describe black box versions of popular acquisition functions that can be used in our framework automatically, without modelspecific derivation, and show how to optimize these functions. We also introduce a model, which we term the Bayesian Product of Experts, that integrates into ProBO and can be used to combine information from multiple models implemented with different PPs. We show empirical results using multiple PP implementations, and compare against standard BO methods.
01/31/2019 ∙ by Willie Neiswanger, et al. ∙ 18 ∙ shareread it

Competencebased Curriculum Learning for Neural Machine Translation
Current stateoftheart NMT systems use large neural networks that are not only slow to train, but also often require many heuristics and optimization tricks, such as specialized learning rate schedules and large batch sizes. This is undesirable as it requires extensive hyperparameter tuning. In this paper, we propose a curriculum learning framework for NMT that reduces training time, reduces the need for specialized heuristics or large batch sizes, and results in overall better performance. Our framework consists of a principled way of deciding which training samples are shown to the model at different times during training, based on the estimated difficulty of a sample and the current competence of the model. Filtering training samples in this manner prevents the model from getting stuck in bad local optima, making it converge faster and reach a better solution than the common approach of uniformly sampling training examples. Furthermore, the proposed method can be easily applied to existing NMT models by simply modifying their input data pipelines. We show that our framework can help improve the training time and the performance of both recurrent neural network models and Transformers, achieving up to a 70 decrease in training time, while at the same time obtaining accuracy improvements of up to 2.2 BLEU.
03/23/2019 ∙ by Emmanouil Antonios Platanios, et al. ∙ 14 ∙ shareread it

Tuning Hyperparameters without Grad Students: Scalable and Robust Bayesian Optimisation with Dragonfly
Bayesian Optimisation (BO), refers to a suite of techniques for global optimisation of expensive black box functions, which use introspective Bayesian models of the function to efficiently find the optimum. While BO has been applied successfully in many applications, modern optimisation tasks usher in new challenges where conventional methods fail spectacularly. In this work, we present Dragonfly, an open source Python library for scalable and robust BO. Dragonfly incorporates multiple recently developed methods that allow BO to be applied in challenging real world settings; these include better methods for handling higher dimensional domains, methods for handling multifidelity evaluations when cheap approximations of an expensive function are available, methods for optimising over structured combinatorial spaces, such as the space of neural network architectures, and methods for handling parallel evaluations. Additionally, we develop new methodological improvements in BO for selecting the Bayesian model, selecting the acquisition function, and optimising over complex domains with different variable types and additional constraints. We compare Dragonfly to a suite of other packages and algorithms for global optimisation and demonstrate that when the above methods are integrated, they enable significant improvements in the performance of BO. The Dragonfly library is available at dragonfly.github.io.
03/15/2019 ∙ by Kirthevasan Kandasamy, et al. ∙ 14 ∙ shareread it

Graph Neural Tangent Kernel: Fusing Graph Neural Networks with Graph Kernels
While graph kernels (GKs) are easy to train and enjoy provable theoretical guarantees, their practical performances are limited by their expressive power, as the kernel function often depends on handcrafted combinatorial features of graphs. Compared to graph kernels, graph neural networks (GNNs) usually achieve better practical performance, as GNNs use multilayer architectures and nonlinear activation functions to extract highorder information of graphs as features. However, due to the large number of hyperparameters and the nonconvex nature of the training procedure, GNNs are harder to train. Theoretical guarantees of GNNs are also not wellunderstood. Furthermore, the expressive power of GNNs scales with the number of parameters, and thus it is hard to exploit the full power of GNNs when computing resources are limited. The current paper presents a new class of graph kernels, Graph Neural Tangent Kernels (GNTKs), which correspond to infinitely wide multilayer GNNs trained by gradient descent. GNTKs enjoy the full expressive power of GNNs and inherit advantages of GKs. Theoretically, we show GNTKs provably learn a class of smooth functions on graphs. Empirically, we test GNTKs on graph classification datasets and show they achieve strong performance.
05/30/2019 ∙ by Simon S. Du, et al. ∙ 4 ∙ shareread it

Cautious Deep Learning
Most classifiers operate by selecting the maximum of an estimate of the conditional distribution p(yx) where x stands for the features of the instance to be classified and y denotes its label. This often results in a hubristic bias: overconfidence in the assignment of a definite label. Usually, the observations are concentrated on a small volume but the classifier provides definite predictions for the entire space. We propose constructing conformal prediction sets [vovk2005algorithmic] which contain a set of labels rather than a single label. These conformal prediction sets contain the true label with probability 1α. Our construction is based on p(xy) rather than p(yx) which results in a classifier that is very cautious: it outputs the null set  meaning `I don't know'  when the object does not resemble the training examples. An important property of our approach is that classes can be added or removed without having to retrain the classifier. We demonstrate the performance on the ImageNet ILSVRC dataset using high dimensional features obtained from state of the art convolutional neural networks.
05/24/2018 ∙ by Yotam Hechtlinger, et al. ∙ 2 ∙ shareread it

Estimating Cosmological Parameters from the Dark Matter Distribution
A grand challenge of the 21st century cosmology is to accurately estimate the cosmological parameters of our Universe. A major approach to estimating the cosmological parameters is to use the largescale matter distribution of the Universe. Galaxy surveys provide the means to map out cosmic largescale structure in three dimensions. Information about galaxy locations is typically summarized in a "single" function of scale, such as the galaxy correlation function or powerspectrum. We show that it is possible to estimate these cosmological parameters directly from the distribution of matter. This paper presents the application of deep 3D convolutional networks to volumetric representation of darkmatter simulations as well as the results obtained using a recently proposed distribution regression framework, showing that machine learning techniques are comparable to, and can sometimes outperform, maximumlikelihood point estimates using "cosmological models". This opens the way to estimating the parameters of our Universe with higher accuracy.
11/06/2017 ∙ by Siamak Ravanbakhsh, et al. ∙ 0 ∙ shareread it

A Generic Approach for Escaping Saddle points
A central challenge to using firstorder methods for optimizing nonconvex problems is the presence of saddle points. Firstorder methods often get stuck at saddle points, greatly deteriorating their performance. Typically, to escape from saddles one has to use secondorder methods. However, most works on secondorder methods rely extensively on expensive Hessianbased computations, making them impractical in largescale settings. To tackle this challenge, we introduce a generic framework that minimizes Hessian based computations while at the same time provably converging to secondorder critical points. Our framework carefully alternates between a firstorder and a secondorder subroutine, using the latter only close to saddle points, and yields convergence results competitive to the stateoftheart. Empirical results suggest that our strategy also enjoys a good practical performance.
09/05/2017 ∙ by Sashank J Reddi, et al. ∙ 0 ∙ shareread it

On the Reconstruction Risk of Convolutional Sparse Dictionary Learning
Sparse dictionary learning (SDL) has become a popular method for adaptively identifying parsimonious representations of a dataset, a fundamental problem in machine learning and signal processing. While most work on SDL assumes a training dataset of independent and identically distributed samples, a variant known as convolutional sparse dictionary learning (CSDL) relaxes this assumption, allowing more general sequential data sources, such as time series or other dependent data. Although recent work has explored the statistical properties of classical SDL, the statistical properties of CSDL remain unstudied. This paper begins to study this by identifying the minimax convergence rate of CSDL in terms of reconstruction risk, by both upper bounding the risk of an established CSDL estimator and proving a matching informationtheoretic lower bound. Our results indicate that consistency in reconstruction risk is possible precisely in the `ultrasparse' setting, in which the sparsity (i.e., the number of feature occurrences) is in o(N) in terms of the length N of the training sequence. Notably, our results make very weak assumptions, allowing arbitrary dictionaries and dependent measurement noise. Finally, we verify our theoretical results with numerical experiments on synthetic data.
08/29/2017 ∙ by Shashank Singh, et al. ∙ 0 ∙ shareread it

Recurrent Estimation of Distributions
This paper presents the recurrent estimation of distributions (RED) for modeling realvalued data in a semiparametric fashion. RED models make two novel uses of recurrent neural networks (RNNs) for density estimation of general realvalued data. First, RNNs are used to transform input covariates into a latent space to better capture conditional dependencies in inputs. After, an RNN is used to compute the conditional distributions of the latent covariates. The resulting model is efficient to train, compute, and sample from, whilst producing normalized pdfs. The effectiveness of RED is shown via several realworld data experiments. Our results show that RED models achieve a lower heldout negative loglikelihood than other neural network approaches across multiple dataset sizes and dimensionalities. Further context of the efficacy of RED is provided by considering anomaly detection tasks, where we also observe better performance over alternative models.
05/30/2017 ∙ by Junier B. Oliva, et al. ∙ 0 ∙ shareread it