
-
Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability
We empirically demonstrate that full-batch gradient descent on neural ne...
read it
-
When Is Generalizable Reinforcement Learning Tractable?
Agents trained by reinforcement learning (RL) often fail to generalize b...
read it
-
Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning
We formally study how Ensemble of deep learning models can improve test ...
read it
-
A law of robustness for two-layers neural networks
We initiate the study of the inherent tradeoffs between the size of a ne...
read it
-
Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK
We consider the dynamic of gradient descent for learning a two-layer neu...
read it
-
Feature Purification: How Adversarial Training Performs Robust Deep Learning
Despite the great empirical success of adversarial training to defend de...
read it
-
When can Wasserstein GANs minimize Wasserstein Distance?
Generative Adversarial Networks (GANs) are widely used models to learn c...
read it
-
Backward Feature Correction: How Deep Learning Performs Deep Learning
How does a 110-layer ResNet learn a high-complexity classifier using rel...
read it
-
Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks
Stochastic gradient descent with a large initial learning rate is a wide...
read it
-
Complexity of Highly Parallel Non-Smooth Convex Optimization
A landmark result of non-smooth convex optimization is that gradient des...
read it
-
What Can ResNet Learn Efficiently, Going Beyond Kernels?
How can neural networks such as ResNet efficiently learn CIFAR-10 with t...
read it
-
Non-Stochastic Multi-Player Multi-Armed Bandits: Optimal Rate With Collision Information, Sublinear Without
We consider the non-stochastic version of the (cooperative) multi-player...
read it
-
Can SGD Learn Recurrent Neural Networks with Provable Generalization?
Recurrent Neural Networks (RNNs) are among the most popular models in se...
read it
-
Improved Path-length Regret Bounds for Bandits
We study adaptive regret bounds in terms of the variation of the losses ...
read it
-
Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers
Neural networks have great success in many machine learning applications...
read it
-
A Convergence Theory for Deep Learning via Over-Parameterization
Deep neural networks (DNNs) have demonstrated dominating performance in ...
read it
-
Chasing Nested Convex Bodies Nearly Optimally
The convex body chasing problem, introduced by Friedman and Linial, is a...
read it
-
Competitively Chasing Convex Bodies
Let F be a family of sets in some metric space. In the F-chasing problem...
read it
-
On the Convergence Rate of Training Recurrent Neural Networks
Despite the huge success of deep learning, our understanding to how the ...
read it
-
Statistical Convergence of the EM Algorithm on Gaussian Mixture Models
We study the convergence behavior of the Expectation Maximization (EM) a...
read it
-
Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data
Neural networks have many successful applications, while much less theor...
read it
-
Algorithmic Framework for Model-based Reinforcement Learning with Theoretical Guarantees
While model-based reinforcement learning has empirically been shown to s...
read it
-
The Well Tempered Lasso
We study the complexity of the entire regularization path for least squa...
read it
-
Online Improper Learning with an Approximation Oracle
We revisit the question of reducing online learning to approximate optim...
read it
-
Operator Scaling via Geodesically Convex Optimization, Invariant Theory and Polynomial Identity Testing
We propose a new second-order method for geodesically convex optimizatio...
read it
-
Learning Mixtures of Linear Regressions with Nearly Optimal Complexity
Mixtures of Linear Regressions (MLR) is an important mixture model with ...
read it
-
An Alternative View: When Does SGD Escape Local Minima?
Stochastic gradient descent (SGD) is widely used in machine learning. Al...
read it
-
Make the Minority Great Again: First-Order Regret Bound for Contextual Bandits
Regret bounds in online learning compare the player's performance to L^*...
read it
-
Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations
We show that the (stochastic) gradient descent algorithm provides an imp...
read it
-
Algorithmic Regularization in Over-parameterized Matrix Recovery
We study the problem of recovering a low-rank matrix X^ from linear meas...
read it
-
Neon2: Finding Local Minima via First-Order Oracles
We propose a reduction for non-convex optimization that can (1) turn a s...
read it
-
Near-Optimal Discrete Optimization for Experimental Design: A Regret Minimization Approach
The experimental design problem concerns the selection of k points from ...
read it
-
An homotopy method for ℓ_p regression provably beyond self-concordance and in input-sparsity time
We consider the problem of linear regression where the ℓ_2^n norm loss (...
read it
-
Linear Convergence of a Frank-Wolfe Type Algorithm over Trace-Norm Balls
We propose a rank-k variant of the classical Frank-Wolfe algorithm to so...
read it
-
A Nearly Instance Optimal Algorithm for Top-k Ranking under the Multinomial Logit Model
We study the active learning problem of top-k ranking from multi-wise co...
read it
-
Provable Alternating Gradient Descent for Non-negative Matrix Factorization with Strong Correlations
Non-negative matrix factorization is a basic tool for decomposing data i...
read it
-
Follow the Compressed Leader: Faster Online Learning of Eigenvectors and Faster MMWU
The online problem of computing the top eigenvector is fundamental to ma...
read it
-
Recovery Guarantee of Non-negative Matrix Factorization via Alternating Updates
Non-negative matrix factorization is a popular tool for decomposing data...
read it
-
Faster Principal Component Regression and Stable Matrix Chebyshev Approximation
We solve principal component regression (PCR), up to a multiplicative ac...
read it
-
First Efficient Convergence for Streaming k-PCA: a Global, Gap-Free, and Near-Optimal Rate
We study streaming principal component analysis (PCA), that is to find, ...
read it
-
Doubly Accelerated Methods for Faster CCA and Generalized Eigendecomposition
We study k-GenEV, the problem of finding the top k generalized eigenvect...
read it
-
LazySVD: Even Faster SVD Decomposition Yet Without Agonizing Pain
We study k-SVD that is to obtain the first k singular vectors of a matri...
read it
-
Approximate maximum entropy principles via Goemans-Williamson with applications to provable variational methods
The well known maximum-entropy principle due to Jaynes, which states tha...
read it
-
Recovery guarantee of weighted low-rank approximation via alternating minimization
Many applications require recovering a ground truth low-rank matrix from...
read it
-
Linear Algebraic Structure of Word Senses, with Applications to Polysemy
Word embeddings are ubiquitous in NLP and information retrieval, but it'...
read it
-
RAND-WALK: A Latent Variable Model Approach to Word Embeddings
Semantic word embeddings represent the meaning of a word via a vector, a...
read it
-
A Theoretical Analysis of NDCG Type Ranking Measures
A central problem in ranking is to design a ranking measure for evaluati...
read it