
Stochastic Gradient Descent Escapes Saddle Points Efficiently
This paper considers the perturbed stochastic gradient descent algorithm and shows that it finds ϵsecond order stationary points (∇ f(x)≤ϵ and ∇^2 f(x) ≽ √(ϵ)I) in Õ(d/ϵ^4) iterations, giving the first result that has linear dependence on dimension for this setting. For the special case, where stochastic gradients are Lipschitz, the dependence on dimension reduces to polylogarithmic. In addition to giving new results, this paper also presents a simplified proof strategy that gives a shorter and more elegant proof of previously known results (Jin et al. 2017) on perturbed gradient descent algorithm.
02/13/2019 ∙ by Chi Jin, et al. ∙ 20 ∙ shareread it

A Short Note on Concentration Inequalities for Random Vectors with SubGaussian Norm
In this note, we derive concentration inequalities for random vectors with subGaussian norm (a generalization of both subGaussian random vectors and norm bounded random vectors), which are tight up to logarithmic factors.
02/11/2019 ∙ by Chi Jin, et al. ∙ 16 ∙ shareread it

The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure
There is a stark disparity between the step size schedules used in practical large scale machine learning and those that are considered optimal by the theory of stochastic approximation. In theory, most results utilize polynomially decaying learning rate schedules, while, in practice, the "Step Decay" schedule is among the most popular schedules, where the learning rate is cut every constant number of epochs (i.e. this is a geometrically decaying schedule). This work examines the stepdecay schedule for the stochastic optimization problem of streaming least squares regression (both in the nonstrongly convex and strongly convex case), where we show that a sharp theoretical characterization of an optimal learning rate schedule is far more nuanced than suggested by previous work. We focus specifically on the rate that is achievable when using the final iterate of stochastic gradient descent, as is commonly done in practice. Our main result provably shows that a properly tuned geometrically decaying learning rate schedule provides an exponential improvement (in terms of the condition number) over any polynomially decaying learning rate schedule. We also provide experimental support for wider applicability of these results, including for training modern deep neural networks.
04/29/2019 ∙ by Rong Ge, et al. ∙ 12 ∙ shareread it

Explaining Landscape Connectivity of Lowcost Solutions for Multilayer Nets
Mode connectivity is a surprising phenomenon in the loss landscape of deep nets. Optimaat least those discovered by gradientbased optimizationturn out to be connected by simple paths on which the loss function is almost constant. Often, these paths can be chosen to be piecewise linear, with as few as two segments. We give mathematical explanations for this phenomenon, assuming generic properties (such as dropout stability and noise stability) of welltrained deep nets, which have previously been identified as part of understanding the generalization properties of deep nets. Our explanation holds for realistic multilayer nets, and experiments are presented to verify the theory.
06/14/2019 ∙ by Rohith Kuditipudi, et al. ∙ 3 ∙ shareread it

Beyond Logconcavity: Provable Guarantees for Sampling Multimodal Distributions using Simulated Tempering Langevin Monte Carlo
A key task in Bayesian statistics is sampling from distributions that are only specified up to a partition function (i.e., constant of proportionality). However, without any assumptions, sampling (even approximately) can be #Phard, and few works have provided "beyond worstcase" guarantees for such settings. For logconcave distributions, classical results going back to Bakry and Émery (1985) show that natural continuoustime Markov chains called Langevin diffusions mix in polynomial time. The most salient feature of logconcavity violated in practice is unimodality: commonly, the distributions we wish to sample from are multimodal. In the presence of multiple deep and wellseparated modes, Langevin diffusion suffers from torpid mixing. We address this problem by combining Langevin diffusion with simulated tempering. The result is a Markov chain that mixes more rapidly by transitioning between different temperatures of the distribution. We analyze this Markov chain for the canonical multimodal distribution: a mixture of gaussians (of equal variance). The algorithm based on our Markov chain provably samples from distributions that are close to mixtures of gaussians, given access to the gradient of the logpdf. For the analysis, we use a spectral decomposition theorem for graphs (Gharan and Trevisan, 2014) and a Markov chain decomposition technique (Madras and Randall, 2002).
10/07/2017 ∙ by Rong Ge, et al. ∙ 0 ∙ shareread it

On the Optimization Landscape of Tensor Decompositions
Nonconvex optimization with local search heuristics has been widely used in machine learning, achieving many stateofart results. It becomes increasingly important to understand why they can work for these NPhard problems on typical data. The landscape of many objective functions in learning has been conjectured to have the geometric property that "all local optima are (approximately) global optima", and thus they can be solved efficiently by local search algorithms. However, establishing such property can be very difficult. In this paper, we analyze the optimization landscape of the random overcomplete tensor decomposition problem, which has many applications in unsupervised learning, especially in learning latent variable models. In practice, it can be efficiently solved by gradient ascent on a nonconvex objective. We show that for any small constant ϵ > 0, among the set of points with function values (1+ϵ)factor larger than the expectation of the function, all the local maxima are approximate global maxima. Previously, the bestknown result only characterizes the geometry in small neighborhoods around the true components. Our result implies that even with an initialization that is barely better than the random guess, the gradient ascent algorithm is guaranteed to solve this problem. Our main technique uses KacRice formula and random matrix theory. To our best knowledge, this is the first time when KacRice formula is successfully applied to counting the number of local minima of a highlystructured random polynomial with dependent coefficients.
06/18/2017 ∙ by Rong Ge, et al. ∙ 0 ∙ shareread it

No Spurious Local Minima in Nonconvex Low Rank Problems: A Unified Geometric Analysis
In this paper we develop a new framework that captures the common landscape underlying the common nonconvex lowrank matrix problems including matrix sensing, matrix completion and robust PCA. In particular, we show for all above problems (including asymmetric cases): 1) all local minima are also globally optimal; 2) no highorder saddle points exists. These results explain why simple algorithms such as stochastic gradient descent have global converge, and efficiently optimize these nonconvex objective functions in practice. Our framework connects and simplifies the existing analyses on optimization landscapes for matrix sensing and symmetric matrix completion. The framework naturally leads to new results for asymmetric matrix completion and robust PCA.
04/03/2017 ∙ by Rong Ge, et al. ∙ 0 ∙ shareread it

How to Escape Saddle Points Efficiently
This paper shows that a perturbed form of gradient descent converges to a secondorder stationary point in a number iterations which depends only polylogarithmically on dimension (i.e., it is almost "dimensionfree"). The convergence rate of this procedure matches the wellknown convergence rate of gradient descent to firstorder stationary points, up to log factors. When all saddle points are nondegenerate, all secondorder stationary points are local minima, and our result thus shows that perturbed gradient descent can escape saddle points almost for free. Our results can be directly applied to many machine learning applications, including deep learning. As a particular concrete example of such an application, we show that our results can be used directly to establish sharp global convergence rates for matrix factorization. Our results rely on a novel characterization of the geometry around saddle points, which may be of independent interest to the nonconvex optimization community.
03/02/2017 ∙ by Chi Jin, et al. ∙ 0 ∙ shareread it

Generalization and Equilibrium in Generative Adversarial Nets (GANs)
We show that training of generative adversarial network (GAN) may not have good generalization properties; e.g., training may appear successful but the trained distribution may be far from target distribution in standard metrics. However, generalization does occur for a weaker metric called neural net distance. It is also shown that an approximate pure equilibrium exists in the discriminator/generator game for a special class of generators with natural training objectives when generator capacity and training set sizes are moderate. This existence of equilibrium inspires MIX+GAN protocol, which can be combined with any existing GAN training, and empirically shown to improve some of them.
03/02/2017 ∙ by Sanjeev Arora, et al. ∙ 0 ∙ shareread it

Provable learning of Noisyor Networks
Many machine learning applications use latent variable models to explain structure in data, whereby visible variables (= coordinates of the given datapoint) are explained as a probabilistic function of some hidden variables. Finding parameters with the maximum likelihood is NPhard even in very simple settings. In recent years, provably efficient algorithms were nevertheless developed for models with linear structures: topic models, mixture models, hidden markov models, etc. These algorithms use matrix or tensor decomposition, and make some reasonable assumptions about the parameters of the underlying model. But matrix or tensor decomposition seems of little use when the latent variable model has nonlinearities. The current paper shows how to make progress: tensor decomposition is applied for learning the singlelayer noisy or network, which is a textbook example of a Bayes net, and used for example in the classic QMRDT software for diagnosing which disease(s) a patient may have by observing the symptoms he/she exhibits. The technical novelty here, which should be useful in other settings in future, is analysis of tensor decomposition in presence of systematic error (i.e., where the noise/error is correlated with the signal, and doesn't decrease as number of samples goes to infinity). This requires rethinking all steps of tensor decomposition methods from the ground up. For simplicity our analysis is stated assuming that the network parameters were chosen from a probability distribution but the method seems more generally applicable.
12/28/2016 ∙ by Sanjeev Arora, et al. ∙ 0 ∙ shareread it

Homotopy Analysis for Tensor PCA
Developing efficient and guaranteed nonconvex algorithms has been an important challenge in modern machine learning. Algorithms with good empirical performance such as stochastic gradient descent often lack theoretical guarantees. In this paper, we analyze the class of homotopy or continuation methods for global optimization of nonconvex functions. These methods start from an objective function that is efficient to optimize (e.g. convex), and progressively modify it to obtain the required objective, and the solutions are passed along the homotopy path. For the challenging problem of tensor PCA, we prove global convergence of the homotopy method in the "high noise" regime. The signaltonoise requirement for our algorithm is tight in the sense that it matches the recovery guarantee for the best degree4 sumofsquares algorithm. In addition, we prove a phase transition along the homotopy path for tensor PCA. This allows to simplify the homotopy method to a local search algorithm, viz., tensor power iterations, with a specific initialization and a noise injection procedure, while retaining the theoretical guarantees.
10/28/2016 ∙ by Anima Anandkumar, et al. ∙ 0 ∙ shareread it
Rong Ge
is this you? claim profile
Assistant Professor at the Computer Science Department of Duke University.