
Mime: Mimicking Centralized Stochastic Algorithms in Federated Learning
Federated learning is a challenging optimization problem due to the hete...
O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers
Transformer networks use pairwise attention to compute contextual embedd...
Why distillation helps: a statistical perspective
Knowledge distillation is a technique for improving the performance of a...
Doublystochastic mining for heterogeneous retrieval
Modern retrieval problems are characterised by training sets with potent...
LowRank Bottleneck in Multihead Attention Models
Attention based Transformer architecture has enabled significant advance...
Are Transformers universal approximators of sequencetosequence functions?
Despite the widespread adoption of Transformer models for NLP tasks, the...
Why ADAM Beats SGD for Attention Models
While stochastic gradient descent (SGD) is still the de facto algorithm ...
SCAFFOLD: Stochastic Controlled Averaging for OnDevice Federated Learning
Federated learning is a key scenario in modern largescale machine learn...
AdaCliP: Adaptive Clipping for Private SGD
Privacy preserving machine learning algorithms are crucial for learning ...
On the Convergence of Adam and Beyond
Several recently proposed stochastic optimization methods that have been...
Escaping Saddle Points with Adaptive Gradient Methods
Adaptive methods such as Adam and RMSProp are widely used in deep learni...
Stochastic Negative Mining for Learning with Large Output Spaces
We consider the problem of retrieving the most relevant labels for a giv...
A Generic Approach for Escaping Saddle points
A central challenge to using firstorder methods for optimizing nonconve...
AIDE: Fast and Communication Efficient Distributed Optimization
In this paper, we present two new communicationefficient methods for di...
Stochastic FrankWolfe Methods for Nonconvex Optimization
We study FrankWolfe methods for nonconvex stochastic and finitesum opt...
Fast Stochastic Methods for Nonsmooth Nonconvex Optimization
We analyze stochastic algorithms for optimizing nonconvex, nonsmooth fin...
Stochastic Variance Reduction for Nonconvex Optimization
We study nonconvex finitesum problems and analyze stochastic variance r...
Fast Incremental Method for Nonconvex Optimization
We analyze a fast incremental aggregated gradient method for optimizing ...
Adaptivity and ComputationStatistics Tradeoffs for Kernel and Distance based High Dimensional Two Sample Testing
Nonparametric two sample testing is a decision theoretic problem that in...
On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants
We study optimization algorithms based on variance reduction for stochas...
On the Highdimensional Power of Lineartime Kernel TwoSample Testing under Meandifference Alternatives
Nonparametric two sample testing deals with the question of consistently...
A Maximum Likelihood Approach For Selecting Sets of Alternatives
We consider the problem of selecting a subset of alternatives given nois...
Sashank J Reddi
