Suvrit Sra

is this you? claim profile


Researcher at Massachusetts Institute of Technology (MIT), Cofounder; Chief AI Officer at macro-eyes

  • Efficiently testing local optimality and escaping saddles for ReLU networks

    We provide a theoretical algorithm for checking local optimality and escaping saddles at nondifferentiable points of empirical risks of two-layer ReLU networks. Our algorithm receives any parameter value and returns: local minimum, second-order stationary point, or a strict descent direction. The presence of M data points on the nondifferentiability of the ReLU divides the parameter space into at most 2^M regions, which makes analysis difficult. By exploiting polyhedral geometry, we reduce the total computation down to one convex quadratic program (QP) for each hidden node, O(M) (in)equality tests, and one (or a few) nonconvex QP. For the last QP, we show that our specific problem can be solved efficiently, in spite of nonconvexity. In the benign case, we solve one equality constrained QP, and we prove that projected gradient descent solves it exponentially fast. In the bad case, we have to solve a few more inequality constrained QPs, but we prove that the time complexity is exponential only in the number of inequality constraints. Our experiments show that either benign case or bad case with very few inequality constraints occurs, implying that our algorithm is efficient in most cases.

    09/28/2018 ∙ by Chulhee Yun, et al. ∙ 18 share

    read it

  • Finite sample expressive power of small-width ReLU networks

    We study universal finite sample expressivity of neural networks, defined as the capability to perfectly memorize arbitrary datasets. For scalar outputs, existing results require a hidden layer as wide as N to memorize N data points. In contrast, we prove that a 3-layer (2-hidden-layer) ReLU network with 4 √(N) hidden nodes can perfectly fit any arbitrary dataset. For K-class classification, we prove that a 4-layer ReLU network with 4 √(N) + 4K hidden neurons can memorize arbitrary datasets. For example, a 4-layer ReLU network with only 8,000 hidden nodes can memorize datasets with N = 1M and K = 1k (e.g., ImageNet). Our results show that even small networks already have tremendous overfitting capability, admitting zero empirical risk for any dataset. We also extend our results to deeper and narrower networks, and prove converse results showing necessity of Ω(N) parameters for shallow networks.

    10/17/2018 ∙ by Chulhee Yun, et al. ∙ 12 share

    read it

  • Near Optimal Stratified Sampling

    The performance of a machine learning system is usually evaluated by using i.i.d. observations with true labels. However, acquiring ground truth labels is expensive, while obtaining unlabeled samples may be cheaper. Stratified sampling can be beneficial in such settings and can reduce the number of true labels required without compromising the evaluation accuracy. Stratified sampling exploits statistical properties (e.g., variance) across strata of the unlabeled population, though usually under the unrealistic assumption that these properties are known. We propose two new algorithms that simultaneously estimate these properties and optimize the evaluation accuracy. We construct a lower bound to show the proposed algorithms (to log-factors) are rate optimal. Experiments on synthetic and real data show the reduction in label complexity that is enabled by our algorithms.

    06/26/2019 ∙ by Tiancheng Yu, et al. ∙ 6 share

    read it

  • Flexible Modeling of Diversity with Strongly Log-Concave Distributions

    Strongly log-concave (SLC) distributions are a rich class of discrete probability distributions over subsets of some ground set. They are strictly more general than strongly Rayleigh (SR) distributions such as the well-known determinantal point process. While SR distributions offer elegant models of diversity, they lack an easy control over how they express diversity. We propose SLC as the right extension of SR that enables easier, more intuitive control over diversity, illustrating this via examples of practical importance. We develop two fundamental tools needed to apply SLC distributions to learning and inference: sampling and mode finding. For sampling we develop an MCMC sampler and give theoretical mixing time bounds. For mode finding, we establish a weak log-submodularity property for SLC functions and derive optimization guarantees for a distorted greedy algorithm.

    06/12/2019 ∙ by Joshua Robinson, et al. ∙ 4 share

    read it

  • Efficient Policy Learning for Non-Stationary MDPs under Adversarial Manipulation

    A Markov Decision Process (MDP) is a popular model for reinforcement learning. However, its commonly used assumption of stationary dynamics and rewards is too stringent and fails to hold in adversarial, nonstationary, or multi-agent problems. We study an episodic setting where the parameters of an MDP can differ across episodes. We learn a reliable policy of this potentially adversarial MDP by developing an Adversarial Reinforcement Learning (ARL) algorithm that reduces our MDP to a sequence of adversarial bandit problems. ARL achieves O(√(SATH^3)) regret, which is optimal with respect to S, A, and T, and its dependence on H is the best (even for the usual stationary MDP) among existing model-free methods.

    07/22/2019 ∙ by Tiancheng Yu, et al. ∙ 4 share

    read it

  • Are deep ResNets provably better than linear predictors?

    Recently, a residual network (ResNet) with a single residual block has been shown to outperform linear predictors, in the sense that all its local minima are at least as good as the best linear predictor. We take a step towards extending this result to deep ResNets. As motivation, we first show that there exist datasets for which all local minima of a fully-connected ReLU network are no better than the best linear predictor, while a ResNet can have strictly better local minima. Second, we show that even at its global minimum, the representation obtained from the residual blocks of a 2-block ResNet does not necessarily improve monotonically as more blocks are added, highlighting a fundamental difficulty in analyzing deep ResNets. Our main result on deep ResNets shows that (under some geometric conditions) any critical point is either (i) at least as good as the best linear predictor; or (ii) the Hessian at this critical point has a strictly negative eigenvalue. Finally, we complement our results by analyzing near-identity regions of deep ResNets, obtaining size-independent upper bounds for the risk attained at critical points as well as the Rademacher complexity.

    07/09/2019 ∙ by Chulhee Yun, et al. ∙ 2 share

    read it

  • A Generic Approach for Escaping Saddle points

    A central challenge to using first-order methods for optimizing nonconvex problems is the presence of saddle points. First-order methods often get stuck at saddle points, greatly deteriorating their performance. Typically, to escape from saddles one has to use second-order methods. However, most works on second-order methods rely extensively on expensive Hessian-based computations, making them impractical in large-scale settings. To tackle this challenge, we introduce a generic framework that minimizes Hessian based computations while at the same time provably converging to second-order critical points. Our framework carefully alternates between a first-order and a second-order subroutine, using the latter only close to saddle points, and yields convergence results competitive to the state-of-the-art. Empirical results suggest that our strategy also enjoys a good practical performance.

    09/05/2017 ∙ by Sashank J Reddi, et al. ∙ 0 share

    read it

  • Global optimality conditions for deep neural networks

    We study the error landscape of deep linear and nonlinear neural networks with square error loss. We build on the recent results in the literature and present necessary and sufficient conditions for a critical point of the empirical risk function to be a global minimum in the deep linear network case. Our simple conditions can also be used to determine whether a given critical point is a global minimum or a saddle point. We further extend these results to deep nonlinear neural networks and prove similar sufficient conditions for global optimality in the function space.

    07/08/2017 ∙ by Chulhee Yun, et al. ∙ 0 share

    read it

  • Diversity Networks: Neural Network Compression Using Determinantal Point Processes

    We introduce Divnet, a flexible technique for learning networks with diverse neurons. Divnet models neuronal diversity by placing a Determinantal Point Process (DPP) over neurons in a given layer. It uses this DPP to select a subset of diverse neurons and subsequently fuses the redundant neurons into the selected ones. Compared with previous approaches, Divnet offers a more principled, flexible technique for capturing neuronal diversity and thus implicitly enforcing regularization. This enables effective auto-tuning of network architecture and leads to smaller network sizes without hurting performance. Moreover, through its focus on diversity and neuron fusing, Divnet remains compatible with other procedures that seek to reduce memory footprints of networks. We present experimental results to corroborate our claims: for pruning neural networks, Divnet is seen to be notably superior to competing approaches.

    11/16/2015 ∙ by Zelda Mariet, et al. ∙ 0 share

    read it

  • An Alternative to EM for Gaussian Mixture Models: Batch and Stochastic Riemannian Optimization

    We consider maximum likelihood estimation for Gaussian Mixture Models (Gmms). This task is almost invariably solved (in theory and practice) via the Expectation Maximization (EM) algorithm. EM owes its success to various factors, of which is its ability to fulfill positive definiteness constraints in closed form is of key importance. We propose an alternative to EM by appealing to the rich Riemannian geometry of positive definite matrices, using which we cast Gmm parameter estimation as a Riemannian optimization problem. Surprisingly, such an out-of-the-box Riemannian formulation completely fails and proves much inferior to EM. This motivates us to take a closer look at the problem geometry, and derive a better formulation that is much more amenable to Riemannian optimization. We then develop (Riemannian) batch and stochastic gradient algorithms that outperform EM, often substantially. We provide a non-asymptotic convergence analysis for our stochastic method, which is also the first (to our knowledge) such global analysis for Riemannian stochastic gradient. Numerous empirical results are included to demonstrate the effectiveness of our methods.

    06/10/2017 ∙ by Reshad Hosseini, et al. ∙ 0 share

    read it

  • Polynomial Time Algorithms for Dual Volume Sampling

    We study dual volume sampling, a method for selecting k columns from an n x m short and wide matrix (n <= k <= m) such that the probability of selection is proportional to the volume spanned by the rows of the induced submatrix. This method was proposed by Avron and Boutsidis (2013), who showed it to be a promising method for column subset selection and its multiple applications. However, its wider adoption has been hampered by the lack of polynomial time sampling algorithms. We remove this hindrance by developing an exact (randomized) polynomial time sampling algorithm as well as its derandomization. Thereafter, we study dual volume sampling via the theory of real stable polynomials and prove that its distribution satisfies the "Strong Rayleigh" property. This result has numerous consequences, including a provably fast-mixing Markov chain sampler that makes dual volume sampling much more attractive to practitioners. This sampler is closely related to classical algorithms for popular experimental design methods that are to date lacking theoretical analysis but are known to empirically work well.

    03/08/2017 ∙ by Chengtao Li, et al. ∙ 0 share

    read it