Sashank J Reddi

research

∙ 05/13/2023

Depth Dependence of μP Learning Rates in ReLU MLPs

In this short note we consider random fully connected ReLU networks of w...

7 Samy Jelassi, et al. ∙

research

∙ 12/01/2022

Differentially Private Adaptive Optimization with Delayed Preconditioners

Privacy noise may negate the benefits of using adaptive optimizers in di...

0 Tian Li, et al. ∙

research

∙ 10/12/2022

Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers

This paper studies the curious phenomenon for machine learning models wi...

19 Zonglin Li, et al. ∙

research

∙ 02/12/2022

Private Adaptive Optimization with Side Information

Adaptive optimization methods have become the default solvers for many m...

0 Tian Li, et al. ∙

research

∙ 02/02/2022

Robust Training of Neural Networks using Scale Invariant Architectures

In contrast to SGD, adaptive gradient methods like Adam allow robust tra...

6 Zhiyuan Li, et al. ∙

research

∙ 02/13/2021

Distilling Double Descent

Distillation is the technique of training a "student" model based on exa...

0 Andrew Cotter, et al. ∙

research

∙ 08/08/2020

Mime: Mimicking Centralized Stochastic Algorithms in Federated Learning

Federated learning is a challenging optimization problem due to the hete...

0 Sai Praneeth Karimireddy, et al. ∙

research

∙ 06/08/2020

O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers

Transformer networks use pairwise attention to compute contextual embedd...

5 Chulhee Yun, et al. ∙

research

∙ 05/21/2020

Why distillation helps: a statistical perspective

Knowledge distillation is a technique for improving the performance of a...

41 Aditya Krishna Menon, et al. ∙

research

∙ 04/23/2020

Doubly-stochastic mining for heterogeneous retrieval

Modern retrieval problems are characterised by training sets with potent...

6 Ankit Singh Rawat, et al. ∙

research

∙ 02/17/2020

Low-Rank Bottleneck in Multi-head Attention Models

Attention based Transformer architecture has enabled significant advance...

9 Srinadh Bhojanapalli, et al. ∙

research

∙ 12/20/2019

Are Transformers universal approximators of sequence-to-sequence functions?

Despite the widespread adoption of Transformer models for NLP tasks, the...

24 Chulhee Yun, et al. ∙

research

∙ 12/06/2019

Why ADAM Beats SGD for Attention Models

While stochastic gradient descent (SGD) is still the de facto algorithm ...

0 Jingzhao Zhang, et al. ∙

research

∙ 10/14/2019

SCAFFOLD: Stochastic Controlled Averaging for On-Device Federated Learning

Federated learning is a key scenario in modern large-scale machine learn...

7 Sai Praneeth Karimireddy, et al. ∙

research

∙ 08/20/2019

AdaCliP: Adaptive Clipping for Private SGD

Privacy preserving machine learning algorithms are crucial for learning ...

1 Venkatadheeraj Pichapati, et al. ∙

research

∙ 04/19/2019

On the Convergence of Adam and Beyond

Several recently proposed stochastic optimization methods that have been...

30 Sashank J Reddi, et al. ∙

research

∙ 01/26/2019

Escaping Saddle Points with Adaptive Gradient Methods

Adaptive methods such as Adam and RMSProp are widely used in deep learni...

0 Matthew Staib, et al. ∙

research

∙ 10/16/2018

Stochastic Negative Mining for Learning with Large Output Spaces

We consider the problem of retrieving the most relevant labels for a giv...

0 Sashank J Reddi, et al. ∙

research

∙ 09/05/2017

A Generic Approach for Escaping Saddle points

A central challenge to using first-order methods for optimizing nonconve...

0 Sashank J Reddi, et al. ∙

research

∙ 08/24/2016

AIDE: Fast and Communication Efficient Distributed Optimization

In this paper, we present two new communication-efficient methods for di...

0 Sashank J Reddi, et al. ∙

research

∙ 07/27/2016

Stochastic Frank-Wolfe Methods for Nonconvex Optimization

We study Frank-Wolfe methods for nonconvex stochastic and finite-sum opt...

0 Sashank J Reddi, et al. ∙

research

∙ 05/23/2016

Fast Stochastic Methods for Nonsmooth Nonconvex Optimization

We analyze stochastic algorithms for optimizing nonconvex, nonsmooth fin...

0 Sashank J Reddi, et al. ∙

research

∙ 03/19/2016

Stochastic Variance Reduction for Nonconvex Optimization

We study nonconvex finite-sum problems and analyze stochastic variance r...

0 Sashank J Reddi, et al. ∙

research

∙ 03/19/2016

Fast Incremental Method for Nonconvex Optimization

We analyze a fast incremental aggregated gradient method for optimizing ...

0 Sashank J Reddi, et al. ∙

research

∙ 08/04/2015

Adaptivity and Computation-Statistics Tradeoffs for Kernel and Distance based High Dimensional Two Sample Testing

Nonparametric two sample testing is a decision theoretic problem that in...

0 Aaditya Ramdas, et al. ∙

research

∙ 06/23/2015

On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants

We study optimization algorithms based on variance reduction for stochas...

0 Sashank J Reddi, et al. ∙

research

∙ 11/23/2014

On the High-dimensional Power of Linear-time Kernel Two-Sample Testing under Mean-difference Alternatives

Nonparametric two sample testing deals with the question of consistently...

0 Aaditya Ramdas, et al. ∙

research

∙ 10/16/2012

A Maximum Likelihood Approach For Selecting Sets of Alternatives

We consider the problem of selecting a subset of alternatives given nois...

0 Ariel D. Procaccia, et al. ∙

Sashank J Reddi

Featured Co-authors

Sign in with Google

Consider DeepAI Pro