
Distilling Double Descent
Distillation is the technique of training a "student" model based on exa...
read it

Mime: Mimicking Centralized Stochastic Algorithms in Federated Learning
Federated learning is a challenging optimization problem due to the hete...
read it

O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers
Transformer networks use pairwise attention to compute contextual embedd...
read it

Why distillation helps: a statistical perspective
Knowledge distillation is a technique for improving the performance of a...
read it

Doublystochastic mining for heterogeneous retrieval
Modern retrieval problems are characterised by training sets with potent...
read it

LowRank Bottleneck in Multihead Attention Models
Attention based Transformer architecture has enabled significant advance...
read it

Are Transformers universal approximators of sequencetosequence functions?
Despite the widespread adoption of Transformer models for NLP tasks, the...
read it

Why ADAM Beats SGD for Attention Models
While stochastic gradient descent (SGD) is still the de facto algorithm ...
read it

SCAFFOLD: Stochastic Controlled Averaging for OnDevice Federated Learning
Federated learning is a key scenario in modern largescale machine learn...
read it

AdaCliP: Adaptive Clipping for Private SGD
Privacy preserving machine learning algorithms are crucial for learning ...
read it

On the Convergence of Adam and Beyond
Several recently proposed stochastic optimization methods that have been...
read it

Escaping Saddle Points with Adaptive Gradient Methods
Adaptive methods such as Adam and RMSProp are widely used in deep learni...
read it

Stochastic Negative Mining for Learning with Large Output Spaces
We consider the problem of retrieving the most relevant labels for a giv...
read it

A Generic Approach for Escaping Saddle points
A central challenge to using firstorder methods for optimizing nonconve...
read it

AIDE: Fast and Communication Efficient Distributed Optimization
In this paper, we present two new communicationefficient methods for di...
read it

Stochastic FrankWolfe Methods for Nonconvex Optimization
We study FrankWolfe methods for nonconvex stochastic and finitesum opt...
read it

Fast Stochastic Methods for Nonsmooth Nonconvex Optimization
We analyze stochastic algorithms for optimizing nonconvex, nonsmooth fin...
read it

Stochastic Variance Reduction for Nonconvex Optimization
We study nonconvex finitesum problems and analyze stochastic variance r...
read it

Fast Incremental Method for Nonconvex Optimization
We analyze a fast incremental aggregated gradient method for optimizing ...
read it

Adaptivity and ComputationStatistics Tradeoffs for Kernel and Distance based High Dimensional Two Sample Testing
Nonparametric two sample testing is a decision theoretic problem that in...
read it

On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants
We study optimization algorithms based on variance reduction for stochas...
read it

On the Highdimensional Power of Lineartime Kernel TwoSample Testing under Meandifference Alternatives
Nonparametric two sample testing deals with the question of consistently...
read it

A Maximum Likelihood Approach For Selecting Sets of Alternatives
We consider the problem of selecting a subset of alternatives given nois...
read it
Sashank J Reddi
is this you? claim profile