
SGD without Replacement: Sharper Rates for General Smooth Convex Functions
We study stochastic gradient descent without replacement () for smooth convex functions. is widely observed to converge faster than true where each sample is drawn independently with replacement bottou2009curiously and hence, is more popular in practice. But it's convergence properties are not well understood as sampling without replacement leads to coupling between iterates and gradients. By using method of exchangeable pairs to bound Wasserstein distance, we provide the first nonasymptotic results for when applied to general smooth, stronglyconvex functions. In particular, we show that converges at a rate of O(1/K^2) while is known to converge at O(1/K) rate, where K denotes the number of passes over data and is required to be large enough. Existing results for in this setting require additional Hessian Lipschitz assumption gurbuzbalaban2015random,haochen2018random. For small K, we show can achieve same convergence rate as for general smooth stronglyconvex functions. Existing results in this setting require K=1 and hold only for generalized linear models shamir2016without. In addition, by careful analysis of the coupling, for both large and small K, we obtain better dependence on problem dependent parameters like condition number.
03/04/2019 ∙ by Prateek Jain, et al. ∙ 12 ∙ shareread it

Efficient Algorithms for Smooth Minimax Optimization
This paper studies first order methods for solving smooth minimax optimization problems _x _y g(x,y) where g(·,·) is smooth and g(x,·) is concave for each x. In terms of g(·,y), we consider two settings  strongly convex and nonconvex  and improve upon the best known rates in both. For stronglyconvex g(·, y), ∀ y, we propose a new algorithm combining MirrorProx and Nesterov's AGD, and show that it can find global optimum in Õ(1/k^2) iterations, improving over current stateoftheart rate of O(1/k). We use this result along with an inexact proximal point method to provide Õ(1/k^1/3) rate for finding stationary points in the nonconvex setting where g(·, y) can be nonconvex. This improves over current bestknown rate of O(1/k^1/5). Finally, we instantiate our result for finite nonconvex minimax problems, i.e., _x _1≤ i≤ m f_i(x), with nonconvex f_i(·), to obtain convergence rate of O(m( m)^3/2/k^1/3) total gradient evaluations for finding a stationary point.
07/02/2019 ∙ by Kiran Koshy Thekumparampil, et al. ∙ 10 ∙ shareread it

Leveraging Distributional Semantics for MultiLabel Learning
We present a novel and scalable label embedding framework for largescale multilabel learning a.k.a ExMLDS (Extreme MultiLabel Learning using Distributional Semantics). Our approach draws inspiration from ideas rooted in distributional semantics, specifically the Skip Gram Negative Sampling (SGNS) approach, widely used to learn word embeddings for natural language processing tasks. Learning such embeddings can be reduced to a certain matrix factorization. Our approach is novel in that it highlights interesting connections between label embedding methods used for multilabel learning and paragraph/document embedding methods commonly used for learning representations of text data. The framework can also be easily extended to incorporate auxiliary information such as labellabel correlations; this is crucial especially when there are a lot of missing labels in the training data. We demonstrate the effectiveness of our approach through an extensive set of experiments on a variety of benchmark datasets, and show that the proposed learning methods perform favorably compared to several baselines and stateoftheart methods for largescale multilabel learning. To facilitate endtoend learning, we develop a joint learning algorithm that can learn the embeddings as well as a regression model that predicts these embeddings given input features, via efficient gradientbased methods.
09/18/2017 ∙ by Rahul Wadbude, et al. ∙ 0 ∙ shareread it

A Markov Chain Theory Approach to Characterizing the Minimax Optimality of Stochastic Gradient Descent (for Least Squares)
This work provides a simplified proof of the statistical minimax optimality of (iterate averaged) stochastic gradient descent (SGD), for the special case of least squares. This result is obtained by analyzing SGD as a stochastic process and by sharply characterizing the stationary covariance matrix of this process. The finite rate optimality characterization captures the constant factors and addresses model misspecification.
10/25/2017 ∙ by Prateek Jain, et al. ∙ 0 ∙ shareread it

Learning Mixture of Gaussians with Streaming Data
In this paper, we study the problem of learning a mixture of Gaussians with streaming data: given a stream of N points in d dimensions generated by an unknown mixture of k spherical Gaussians, the goal is to estimate the model parameters using a single pass over the data stream. We analyze a streaming version of the popular Lloyd's heuristic and show that the algorithm estimates all the unknown centers of the component Gaussians accurately if they are sufficiently separated. Assuming each pair of centers are Cσ distant with C=Ω((k k)^1/4σ) and where σ^2 is the maximum variance of any Gaussian component, we show that asymptotically the algorithm estimates the centers optimally (up to constants); our center separation requirement matches the best known result for spherical Gaussians vempalawang. For finite samples, we show that a bias term based on the initial estimate decreases at O(1/ poly(N)) rate while variance decreases at nearly optimal rate of σ^2 d/N. Our analysis requires seeding the algorithm with a good initial estimate of the true cluster centers for which we provide an online PCA based clustering algorithm. Indeed, the asymptotic perstep time complexity of our algorithm is the optimal d· k while space complexity of our algorithm is O(dk k). In addition to the bias and variance terms which tend to 0, the hardthresholding based updates of streaming Lloyd's algorithm is agnostic to the data distribution and hence incurs an approximation error that cannot be avoided. However, by using a streaming version of the classical (softthresholdingbased) EM method that exploits the Gaussian distribution explicitly, we show that for a mixture of two Gaussians the true means can be estimated consistently, with estimation error decreasing at nearly optimal rate, and tending to 0 for N→∞.
07/08/2017 ∙ by Aditi Raghunathan, et al. ∙ 0 ∙ shareread it

Recovery Guarantees for Onehiddenlayer Neural Networks
In this paper, we consider regression problems with onehiddenlayer neural networks (1NNs). We distill some properties of activation functions that lead to local strong convexity in the neighborhood of the groundtruth parameters for the 1NN squaredloss objective. Most popular nonlinear activation functions satisfy the distilled properties, including rectified linear units (ReLUs), leaky ReLUs, squared ReLUs and sigmoids. For activation functions that are also smooth, we show local linear convergence guarantees of gradient descent under a resampling rule. For homogeneous activations, we show tensor methods are able to initialize the parameters to fall into the local strong convexity region. As a result, tensor initialization followed by gradient descent is guaranteed to recover the ground truth with sample complexity d ·(1/ϵ) ·poly(k,λ ) and computational complexity n· d ·poly(k,λ) for smooth homogeneous activations with high probability, where d is the dimension of the input, k (k≤ d) is the number of hidden nodes, λ is a conditioning property of the groundtruth parameter matrix between the input layer and the hidden layer, ϵ is the targeted precision and n is the number of samples. To the best of our knowledge, this is the first work that provides recovery guarantees for 1NNs with both sample complexity and computational complexity linear in the input dimension and logarithmic in the precision.
06/10/2017 ∙ by Kai Zhong, et al. ∙ 0 ∙ shareread it

Accelerating Stochastic Gradient Descent
There is widespread sentiment that it is not possible to effectively utilize fast gradient methods (e.g. Nesterov's acceleration, conjugate gradient, heavy ball) for the purposes of stochastic optimization due to their instability and error accumulation, a notion made precise in d'Aspremont 2008 and Devolder, Glineur, and Nesterov 2014. This work considers these issues for the special case of stochastic approximation for the least squares regression problem, and our main result refutes the conventional wisdom by showing that acceleration can be made robust to statistical errors. In particular, this work introduces an accelerated stochastic gradient method that provably achieves the minimax optimal statistical risk faster than stochastic gradient descent. Critical to the analysis is a sharp characterization of accelerated stochastic gradient descent as a stochastic process. We hope this characterization gives insights towards the broader question of designing simple and effective accelerated stochastic methods for more general convex and nonconvex optimization problems.
04/26/2017 ∙ by Prateek Jain, et al. ∙ 0 ∙ shareread it

Parallelizing Stochastic Approximation Through MiniBatching and TailAveraging
This work characterizes the benefits of averaging techniques widely used in conjunction with stochastic gradient descent (SGD). In particular, this work sharply analyzes: (1) minibatching, a method of averaging many samples of the gradient to both reduce the variance of a stochastic gradient estimate and for parallelizing SGD and (2) tailaveraging, a method involving averaging the final few iterates of SGD in order to decrease the variance in SGD's final iterate. This work presents the first tight nonasymptotic generalization error bounds for these schemes for the stochastic approximation problem of least squares regression. Furthermore, this work establishes a precise problemdependent extent to which minibatching can be used to yield provable nearlinear parallelization speedups over SGD with batch size one. These results are utilized in providing a highly parallelizable SGD algorithm that obtains the optimal statistical error rate with nearly the same number of serial updates as batch gradient descent, which improves significantly over existing SGDstyle methods. Finally, this work sheds light on some fundamental differences in SGD's behavior when dealing with agnostic noise in the (nonrealizable) least squares regression problem. In particular, the work shows that the stepsizes that ensure optimal statistical error rates for the agnostic case must be a function of the noise properties. The central analysis tools used by this paper are obtained through generalizing the operator view of averaged SGD, introduced by Defossez and Bach (2015) followed by developing a novel analysis in bounding these operators to characterize the generalization error. These techniques may be of broader interest in analyzing various computational aspects of stochastic approximation.
10/12/2016 ∙ by Prateek Jain, et al. ∙ 0 ∙ shareread it

Efficient and Consistent Robust Time Series Analysis
We study the problem of robust time series analysis under the standard autoregressive (AR) time series model in the presence of arbitrary outliers. We devise an efficient hard thresholding based algorithm which can obtain a consistent estimate of the optimal AR model despite a large fraction of the time series points being corrupted. Our algorithm alternately estimates the corrupted set of points and the model parameters, and is inspired by recent advances in robust regression and hardthresholding methods. However, a direct application of existing techniques is hindered by a critical difference in the timeseries domain: each point is correlated with all previous points rendering existing tools inapplicable directly. We show how to overcome this hurdle using novel proof techniques. Using our techniques, we are also able to provide the first efficient and provably consistent estimator for the robust regression problem where a standard linear observation model with white additive noise is corrupted arbitrarily. We illustrate our methods on synthetic datasets and show that our methods indeed are able to consistently recover the optimal parameters despite a large fraction of points being corrupted.
07/01/2016 ∙ by Kush Bhatia, et al. ∙ 0 ∙ shareread it

Regret Bounds for Nondecomposable Metrics with Missing Labels
We consider the problem of recommending relevant labels (items) for a given data point (user). In particular, we are interested in the practically important setting where the evaluation is with respect to nondecomposable (over labels) performance metrics like the F_1 measure, and the training data has missing labels. To this end, we propose a generic framework that given a performance metric Ψ, can devise a regularized objective function and a threshold such that all the values in the predicted score vector above and only above the threshold are selected to be positive. We show that the regret or generalization error in the given metric Ψ is bounded ultimately by estimation error of certain underlying parameters. In particular, we derive regret bounds under three popular settings: a) collaborative filtering, b) multilabel classification, and c) PU (positiveunlabeled) learning. For each of the above problems, we can obtain precise nonasymptotic regret bound which is small even when a large fraction of labels is missing. Our empirical results on synthetic and benchmark datasets demonstrate that by explicitly modeling for missing labels and optimizing the desired performance metric, our algorithm indeed achieves significantly better performance (like F_1 score) when compared to methods that do not model missing label information carefully.
06/07/2016 ∙ by Prateek Jain, et al. ∙ 0 ∙ shareread it

Streaming PCA: Matching Matrix Bernstein and NearOptimal Finite Sample Guarantees for Oja's Algorithm
This work provides improved guarantees for streaming principle component analysis (PCA). Given A_1, ..., A_n∈R^d× d sampled independently from distributions satisfying E[A_i] = Σ for Σ≽0, this work provides an O(d)space lineartime singlepass streaming algorithm for estimating the top eigenvector of Σ. The algorithm nearly matches (and in certain cases improves upon) the accuracy obtained by the standard batch method that computes top eigenvector of the empirical covariance 1/n∑_i ∈ [n] A_i as analyzed by the matrix Bernstein inequality. Moreover, to achieve constant accuracy, our algorithm improves upon the best previous known sample complexities of streaming algorithms by either a multiplicative factor of O(d) or 1/gap where gap is the relative distance between the top two eigenvalues of Σ. These results are achieved through a novel analysis of the classic Oja's algorithm, one of the oldest and most popular algorithms for streaming PCA. In particular, this work shows that simply picking a random initial point w_0 and applying the update rule w_i + 1 = w_i + η_i A_i w_i suffices to accurately estimate the top eigenvector, with a suitable choice of η_i. We believe our result sheds light on how to efficiently perform streaming PCA both in theory and in practice and we hope that our analysis may serve as the basis for analyzing many variants and extensions of streaming PCA.
02/22/2016 ∙ by Prateek Jain, et al. ∙ 0 ∙ shareread it