
On the Reproducibility of Neural Network Predictions
Standard training techniques for neural networks involve multiple source...
read it

Modifying Memories in Transformer Models
Large Transformer models have achieved impressive performance in many na...
read it

Coping with Label Shift via Distributionally Robust Optimisation
The label shift problem refers to the supervised learning setting where ...
read it

Learning discrete distributions: user vs itemlevel privacy
Much of the literature on differential privacy focuses on itemlevel pri...
read it

O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers
Transformer networks use pairwise attention to compute contextual embedd...
read it

Evaluations and Methods for Explanation through Robustness Analysis
Among multiple ways of interpreting a machine learning model, measuring ...
read it

Why distillation helps: a statistical perspective
Knowledge distillation is a technique for improving the performance of a...
read it

Doublystochastic mining for heterogeneous retrieval
Modern retrieval problems are characterised by training sets with potent...
read it

Federated Learning with Only Positive Labels
We consider learning a multiclass classification model in the federated...
read it

Robust LargeMargin Learning in Hyperbolic Space
Recently, there has been a surge of interest in representation learning ...
read it

Does label smoothing mitigate label noise?
Label smoothing is commonly used in training deep learning models, where...
read it

Adaptive Federated Optimization
Federated learning is a distributed machine learning paradigm in which a...
read it

LowRank Bottleneck in Multihead Attention Models
Attention based Transformer architecture has enabled significant advance...
read it

Pretraining Tasks for Embeddingbased Largescale Retrieval
We consider the largescale querydocument retrieval problem: given a qu...
read it

Are Transformers universal approximators of sequencetosequence functions?
Despite the widespread adoption of Transformer models for NLP tasks, the...
read it

Why ADAM Beats SGD for Attention Models
While stochastic gradient descent (SGD) is still the de facto algorithm ...
read it

Learning to Learn by ZerothOrder Oracle
In the learning to learn (L2L) framework, we cast the design of optimiza...
read it

Online Hierarchical Clustering Approximations
Hierarchical clustering is a widely used approach for clustering dataset...
read it

New Loss Functions for Fast Maximum Inner Product Search
Quantization based methods are popular for solving large scale maximum i...
read it

AdaCliP: Adaptive Clipping for Private SGD
Privacy preserving machine learning algorithms are crucial for learning ...
read it

Sampled Softmax with Random Fourier Features
The computational cost of training with softmax cross entropy loss grows...
read it

Neural SDE: Stabilizing Neural ODE Networks with Stochastic Noise
Neural Ordinary Differential Equation (Neural ODE) has been proposed as ...
read it

On the Convergence of Adam and Beyond
Several recently proposed stochastic optimization methods that have been...
read it

Local Orthogonal Decomposition for Maximum Inner Product Search
Inverted file and asymmetric distance computation (IVFADC) have been suc...
read it

Efficient Inner Product Approximation in Hybrid Spaces
Many emerging use cases of data mining and machine learning operate on l...
read it

Escaping Saddle Points with Adaptive Gradient Methods
Adaptive methods such as Adam and RMSProp are widely used in deep learni...
read it

Learning to Screen for Fast Softmax Inference on Large Vocabulary Neural Networks
Neural language models have been widely used in various NLP tasks, inclu...
read it

Stochastic Negative Mining for Learning with Large Output Spaces
We consider the problem of retrieving the most relevant labels for a giv...
read it

Truncated Laplacian Mechanism for Approximate Differential Privacy
We derive a class of noise probability distributions to preserve (ϵ, δ)...
read it

Optimal NoiseAdding Mechanism in Additive Differential Privacy
We derive the optimal (0, δ)differentially private queryoutput indepen...
read it

The Sparse Recovery Autoencoder
Linear encoding of sparse vectors is widely popular, but is most commonl...
read it

cpSGD: Communicationefficient and differentiallyprivate distributed SGD
Distributed stochastic gradient descent is an important subroutine in di...
read it

Nonlinear Online Learning with Adaptive Nyström Approximation
Use of nonlinear feature maps via kernel approximation has led to succes...
read it

Now Playing: Continuous lowpower music recognition
Existing music recognition applications require a connection to a server...
read it

Efficient Natural Language Response Suggestion for Smart Reply
This paper presents a computationally efficient machinelearned method f...
read it

Stochastic Generative Hashing
Learningbased binary hashing has become a powerful paradigm for fast se...
read it

Orthogonal Random Features
We present an intriguing discovery related to Random Fourier Features: i...
read it

Binary embeddings with structured hashed projections
We consider the hashing mechanism for constructing binary embeddings, th...
read it

Structured Transforms for SmallFootprint Deep Learning
We consider the task of building compact deep learning pipelines suitabl...
read it

Learning to Hash for Indexing Big Data  A Survey
The explosive growth in big data has attracted much attention in designi...
read it

Quantization based Fast Inner Product Search
We propose a quantization based approach for fast approximate Maximum In...
read it

Fast Online Clustering with Randomized Skeleton Sets
We present a new fast online clustering algorithm that reliably recovers...
read it

Compact Nonlinear Maps and Circulant Extensions
Kernel approximation via nonlinear random feature maps is widely used in...
read it

An exploration of parameter redundancy in deep networks with circulant projections
We explore the redundancy of parameters in deep neural networks by repla...
read it

Circulant Binary Embedding
Binary embedding of highdimensional data requires long codes to preserv...
read it

On Learning from Label Proportions
Learning from Label Proportions (LLP) is a learning setting, where the t...
read it

∝SVM for learning with label proportions
We study the problem of learning with label proportions in which the tra...
read it

On the Difficulty of Nearest Neighbor Search
Fast approximate nearest neighbor (NN) search in large databases is beco...
read it

Compact Hyperplane Hashing with Bilinear Functions
Hyperplane hashing aims at rapidly searching nearest points to a hyperpl...
read it
Sanjiv Kumar
is this you? claim profile
Research Scientist at Google Research, NY, Principal Scientist at at Google Research, NY, PhD (2005; Robotics, SCS, CMU)