
-
Mime: Mimicking Centralized Stochastic Algorithms in Federated Learning
Federated learning is a challenging optimization problem due to the hete...
read it
-
O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers
Transformer networks use pairwise attention to compute contextual embedd...
read it
-
Why distillation helps: a statistical perspective
Knowledge distillation is a technique for improving the performance of a...
read it
-
Doubly-stochastic mining for heterogeneous retrieval
Modern retrieval problems are characterised by training sets with potent...
read it
-
Low-Rank Bottleneck in Multi-head Attention Models
Attention based Transformer architecture has enabled significant advance...
read it
-
Are Transformers universal approximators of sequence-to-sequence functions?
Despite the widespread adoption of Transformer models for NLP tasks, the...
read it
-
Why ADAM Beats SGD for Attention Models
While stochastic gradient descent (SGD) is still the de facto algorithm ...
read it
-
SCAFFOLD: Stochastic Controlled Averaging for On-Device Federated Learning
Federated learning is a key scenario in modern large-scale machine learn...
read it
-
AdaCliP: Adaptive Clipping for Private SGD
Privacy preserving machine learning algorithms are crucial for learning ...
read it
-
On the Convergence of Adam and Beyond
Several recently proposed stochastic optimization methods that have been...
read it
-
Escaping Saddle Points with Adaptive Gradient Methods
Adaptive methods such as Adam and RMSProp are widely used in deep learni...
read it
-
Stochastic Negative Mining for Learning with Large Output Spaces
We consider the problem of retrieving the most relevant labels for a giv...
read it
-
A Generic Approach for Escaping Saddle points
A central challenge to using first-order methods for optimizing nonconve...
read it
-
AIDE: Fast and Communication Efficient Distributed Optimization
In this paper, we present two new communication-efficient methods for di...
read it
-
Stochastic Frank-Wolfe Methods for Nonconvex Optimization
We study Frank-Wolfe methods for nonconvex stochastic and finite-sum opt...
read it
-
Fast Stochastic Methods for Nonsmooth Nonconvex Optimization
We analyze stochastic algorithms for optimizing nonconvex, nonsmooth fin...
read it
-
Stochastic Variance Reduction for Nonconvex Optimization
We study nonconvex finite-sum problems and analyze stochastic variance r...
read it
-
Fast Incremental Method for Nonconvex Optimization
We analyze a fast incremental aggregated gradient method for optimizing ...
read it
-
Adaptivity and Computation-Statistics Tradeoffs for Kernel and Distance based High Dimensional Two Sample Testing
Nonparametric two sample testing is a decision theoretic problem that in...
read it
-
On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants
We study optimization algorithms based on variance reduction for stochas...
read it
-
On the High-dimensional Power of Linear-time Kernel Two-Sample Testing under Mean-difference Alternatives
Nonparametric two sample testing deals with the question of consistently...
read it
-
A Maximum Likelihood Approach For Selecting Sets of Alternatives
We consider the problem of selecting a subset of alternatives given nois...
read it