
Efficiently testing local optimality and escaping saddles for ReLU networks
We provide a theoretical algorithm for checking local optimality and escaping saddles at nondifferentiable points of empirical risks of twolayer ReLU networks. Our algorithm receives any parameter value and returns: local minimum, secondorder stationary point, or a strict descent direction. The presence of M data points on the nondifferentiability of the ReLU divides the parameter space into at most 2^M regions, which makes analysis difficult. By exploiting polyhedral geometry, we reduce the total computation down to one convex quadratic program (QP) for each hidden node, O(M) (in)equality tests, and one (or a few) nonconvex QP. For the last QP, we show that our specific problem can be solved efficiently, in spite of nonconvexity. In the benign case, we solve one equality constrained QP, and we prove that projected gradient descent solves it exponentially fast. In the bad case, we have to solve a few more inequality constrained QPs, but we prove that the time complexity is exponential only in the number of inequality constraints. Our experiments show that either benign case or bad case with very few inequality constraints occurs, implying that our algorithm is efficient in most cases.
09/28/2018 ∙ by Chulhee Yun, et al. ∙ 18 ∙ shareread it

Finite sample expressive power of smallwidth ReLU networks
We study universal finite sample expressivity of neural networks, defined as the capability to perfectly memorize arbitrary datasets. For scalar outputs, existing results require a hidden layer as wide as N to memorize N data points. In contrast, we prove that a 3layer (2hiddenlayer) ReLU network with 4 √(N) hidden nodes can perfectly fit any arbitrary dataset. For Kclass classification, we prove that a 4layer ReLU network with 4 √(N) + 4K hidden neurons can memorize arbitrary datasets. For example, a 4layer ReLU network with only 8,000 hidden nodes can memorize datasets with N = 1M and K = 1k (e.g., ImageNet). Our results show that even small networks already have tremendous overfitting capability, admitting zero empirical risk for any dataset. We also extend our results to deeper and narrower networks, and prove converse results showing necessity of Ω(N) parameters for shallow networks.
10/17/2018 ∙ by Chulhee Yun, et al. ∙ 12 ∙ shareread it

Near Optimal Stratified Sampling
The performance of a machine learning system is usually evaluated by using i.i.d. observations with true labels. However, acquiring ground truth labels is expensive, while obtaining unlabeled samples may be cheaper. Stratified sampling can be beneficial in such settings and can reduce the number of true labels required without compromising the evaluation accuracy. Stratified sampling exploits statistical properties (e.g., variance) across strata of the unlabeled population, though usually under the unrealistic assumption that these properties are known. We propose two new algorithms that simultaneously estimate these properties and optimize the evaluation accuracy. We construct a lower bound to show the proposed algorithms (to logfactors) are rate optimal. Experiments on synthetic and real data show the reduction in label complexity that is enabled by our algorithms.
06/26/2019 ∙ by Tiancheng Yu, et al. ∙ 6 ∙ shareread it

Flexible Modeling of Diversity with Strongly LogConcave Distributions
Strongly logconcave (SLC) distributions are a rich class of discrete probability distributions over subsets of some ground set. They are strictly more general than strongly Rayleigh (SR) distributions such as the wellknown determinantal point process. While SR distributions offer elegant models of diversity, they lack an easy control over how they express diversity. We propose SLC as the right extension of SR that enables easier, more intuitive control over diversity, illustrating this via examples of practical importance. We develop two fundamental tools needed to apply SLC distributions to learning and inference: sampling and mode finding. For sampling we develop an MCMC sampler and give theoretical mixing time bounds. For mode finding, we establish a weak logsubmodularity property for SLC functions and derive optimization guarantees for a distorted greedy algorithm.
06/12/2019 ∙ by Joshua Robinson, et al. ∙ 4 ∙ shareread it

Efficient Policy Learning for NonStationary MDPs under Adversarial Manipulation
A Markov Decision Process (MDP) is a popular model for reinforcement learning. However, its commonly used assumption of stationary dynamics and rewards is too stringent and fails to hold in adversarial, nonstationary, or multiagent problems. We study an episodic setting where the parameters of an MDP can differ across episodes. We learn a reliable policy of this potentially adversarial MDP by developing an Adversarial Reinforcement Learning (ARL) algorithm that reduces our MDP to a sequence of adversarial bandit problems. ARL achieves O(√(SATH^3)) regret, which is optimal with respect to S, A, and T, and its dependence on H is the best (even for the usual stationary MDP) among existing modelfree methods.
07/22/2019 ∙ by Tiancheng Yu, et al. ∙ 4 ∙ shareread it

Are deep ResNets provably better than linear predictors?
Recently, a residual network (ResNet) with a single residual block has been shown to outperform linear predictors, in the sense that all its local minima are at least as good as the best linear predictor. We take a step towards extending this result to deep ResNets. As motivation, we first show that there exist datasets for which all local minima of a fullyconnected ReLU network are no better than the best linear predictor, while a ResNet can have strictly better local minima. Second, we show that even at its global minimum, the representation obtained from the residual blocks of a 2block ResNet does not necessarily improve monotonically as more blocks are added, highlighting a fundamental difficulty in analyzing deep ResNets. Our main result on deep ResNets shows that (under some geometric conditions) any critical point is either (i) at least as good as the best linear predictor; or (ii) the Hessian at this critical point has a strictly negative eigenvalue. Finally, we complement our results by analyzing nearidentity regions of deep ResNets, obtaining sizeindependent upper bounds for the risk attained at critical points as well as the Rademacher complexity.
07/09/2019 ∙ by Chulhee Yun, et al. ∙ 2 ∙ shareread it

A Generic Approach for Escaping Saddle points
A central challenge to using firstorder methods for optimizing nonconvex problems is the presence of saddle points. Firstorder methods often get stuck at saddle points, greatly deteriorating their performance. Typically, to escape from saddles one has to use secondorder methods. However, most works on secondorder methods rely extensively on expensive Hessianbased computations, making them impractical in largescale settings. To tackle this challenge, we introduce a generic framework that minimizes Hessian based computations while at the same time provably converging to secondorder critical points. Our framework carefully alternates between a firstorder and a secondorder subroutine, using the latter only close to saddle points, and yields convergence results competitive to the stateoftheart. Empirical results suggest that our strategy also enjoys a good practical performance.
09/05/2017 ∙ by Sashank J Reddi, et al. ∙ 0 ∙ shareread it

Global optimality conditions for deep neural networks
We study the error landscape of deep linear and nonlinear neural networks with square error loss. We build on the recent results in the literature and present necessary and sufficient conditions for a critical point of the empirical risk function to be a global minimum in the deep linear network case. Our simple conditions can also be used to determine whether a given critical point is a global minimum or a saddle point. We further extend these results to deep nonlinear neural networks and prove similar sufficient conditions for global optimality in the function space.
07/08/2017 ∙ by Chulhee Yun, et al. ∙ 0 ∙ shareread it

Diversity Networks: Neural Network Compression Using Determinantal Point Processes
We introduce Divnet, a flexible technique for learning networks with diverse neurons. Divnet models neuronal diversity by placing a Determinantal Point Process (DPP) over neurons in a given layer. It uses this DPP to select a subset of diverse neurons and subsequently fuses the redundant neurons into the selected ones. Compared with previous approaches, Divnet offers a more principled, flexible technique for capturing neuronal diversity and thus implicitly enforcing regularization. This enables effective autotuning of network architecture and leads to smaller network sizes without hurting performance. Moreover, through its focus on diversity and neuron fusing, Divnet remains compatible with other procedures that seek to reduce memory footprints of networks. We present experimental results to corroborate our claims: for pruning neural networks, Divnet is seen to be notably superior to competing approaches.
11/16/2015 ∙ by Zelda Mariet, et al. ∙ 0 ∙ shareread it

An Alternative to EM for Gaussian Mixture Models: Batch and Stochastic Riemannian Optimization
We consider maximum likelihood estimation for Gaussian Mixture Models (Gmms). This task is almost invariably solved (in theory and practice) via the Expectation Maximization (EM) algorithm. EM owes its success to various factors, of which is its ability to fulfill positive definiteness constraints in closed form is of key importance. We propose an alternative to EM by appealing to the rich Riemannian geometry of positive definite matrices, using which we cast Gmm parameter estimation as a Riemannian optimization problem. Surprisingly, such an outofthebox Riemannian formulation completely fails and proves much inferior to EM. This motivates us to take a closer look at the problem geometry, and derive a better formulation that is much more amenable to Riemannian optimization. We then develop (Riemannian) batch and stochastic gradient algorithms that outperform EM, often substantially. We provide a nonasymptotic convergence analysis for our stochastic method, which is also the first (to our knowledge) such global analysis for Riemannian stochastic gradient. Numerous empirical results are included to demonstrate the effectiveness of our methods.
06/10/2017 ∙ by Reshad Hosseini, et al. ∙ 0 ∙ shareread it

Polynomial Time Algorithms for Dual Volume Sampling
We study dual volume sampling, a method for selecting k columns from an n x m short and wide matrix (n <= k <= m) such that the probability of selection is proportional to the volume spanned by the rows of the induced submatrix. This method was proposed by Avron and Boutsidis (2013), who showed it to be a promising method for column subset selection and its multiple applications. However, its wider adoption has been hampered by the lack of polynomial time sampling algorithms. We remove this hindrance by developing an exact (randomized) polynomial time sampling algorithm as well as its derandomization. Thereafter, we study dual volume sampling via the theory of real stable polynomials and prove that its distribution satisfies the "Strong Rayleigh" property. This result has numerous consequences, including a provably fastmixing Markov chain sampler that makes dual volume sampling much more attractive to practitioners. This sampler is closely related to classical algorithms for popular experimental design methods that are to date lacking theoretical analysis but are known to empirically work well.
03/08/2017 ∙ by Chengtao Li, et al. ∙ 0 ∙ shareread it
Suvrit Sra
is this you? claim profile
Researcher at Massachusetts Institute of Technology (MIT), Cofounder; Chief AI Officer at macroeyes