
On the Reproducibility of Neural Network Predictions
Standard training techniques for neural networks involve multiple source...
Modifying Memories in Transformer Models
Large Transformer models have achieved impressive performance in many na...
Coping with Label Shift via Distributionally Robust Optimisation
The label shift problem refers to the supervised learning setting where ...
Learning discrete distributions: user vs itemlevel privacy
Much of the literature on differential privacy focuses on itemlevel pri...
O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers
Transformer networks use pairwise attention to compute contextual embedd...
Evaluations and Methods for Explanation through Robustness Analysis
Among multiple ways of interpreting a machine learning model, measuring ...
Why distillation helps: a statistical perspective
Knowledge distillation is a technique for improving the performance of a...
Doublystochastic mining for heterogeneous retrieval
Modern retrieval problems are characterised by training sets with potent...
Federated Learning with Only Positive Labels
We consider learning a multiclass classification model in the federated...
Robust LargeMargin Learning in Hyperbolic Space
Recently, there has been a surge of interest in representation learning ...
Does label smoothing mitigate label noise?
Label smoothing is commonly used in training deep learning models, where...
Adaptive Federated Optimization
Federated learning is a distributed machine learning paradigm in which a...
LowRank Bottleneck in Multihead Attention Models
Attention based Transformer architecture has enabled significant advance...
Pretraining Tasks for Embeddingbased Largescale Retrieval
We consider the largescale querydocument retrieval problem: given a qu...
Are Transformers universal approximators of sequencetosequence functions?
Despite the widespread adoption of Transformer models for NLP tasks, the...
Why ADAM Beats SGD for Attention Models
While stochastic gradient descent (SGD) is still the de facto algorithm ...
Learning to Learn by ZerothOrder Oracle
In the learning to learn (L2L) framework, we cast the design of optimiza...
Online Hierarchical Clustering Approximations
Hierarchical clustering is a widely used approach for clustering dataset...
New Loss Functions for Fast Maximum Inner Product Search
Quantization based methods are popular for solving large scale maximum i...
AdaCliP: Adaptive Clipping for Private SGD
Privacy preserving machine learning algorithms are crucial for learning ...
Sampled Softmax with Random Fourier Features
The computational cost of training with softmax cross entropy loss grows...
Neural SDE: Stabilizing Neural ODE Networks with Stochastic Noise
Neural Ordinary Differential Equation (Neural ODE) has been proposed as ...
On the Convergence of Adam and Beyond
Several recently proposed stochastic optimization methods that have been...
Local Orthogonal Decomposition for Maximum Inner Product Search
Inverted file and asymmetric distance computation (IVFADC) have been suc...
Efficient Inner Product Approximation in Hybrid Spaces
Many emerging use cases of data mining and machine learning operate on l...
Escaping Saddle Points with Adaptive Gradient Methods
Adaptive methods such as Adam and RMSProp are widely used in deep learni...
Learning to Screen for Fast Softmax Inference on Large Vocabulary Neural Networks
Neural language models have been widely used in various NLP tasks, inclu...
Stochastic Negative Mining for Learning with Large Output Spaces
We consider the problem of retrieving the most relevant labels for a giv...
Truncated Laplacian Mechanism for Approximate Differential Privacy
We derive a class of noise probability distributions to preserve (ϵ, δ)...
Optimal NoiseAdding Mechanism in Additive Differential Privacy
We derive the optimal (0, δ)differentially private queryoutput indepen...
The Sparse Recovery Autoencoder
Linear encoding of sparse vectors is widely popular, but is most commonl...
cpSGD: Communicationefficient and differentiallyprivate distributed SGD
Distributed stochastic gradient descent is an important subroutine in di...
Nonlinear Online Learning with Adaptive Nyström Approximation
Use of nonlinear feature maps via kernel approximation has led to succes...
Now Playing: Continuous lowpower music recognition
Existing music recognition applications require a connection to a server...
Efficient Natural Language Response Suggestion for Smart Reply
This paper presents a computationally efficient machinelearned method f...
Stochastic Generative Hashing
Learningbased binary hashing has become a powerful paradigm for fast se...
Orthogonal Random Features
We present an intriguing discovery related to Random Fourier Features: i...
Binary embeddings with structured hashed projections
We consider the hashing mechanism for constructing binary embeddings, th...
Structured Transforms for SmallFootprint Deep Learning
We consider the task of building compact deep learning pipelines suitabl...
Learning to Hash for Indexing Big Data  A Survey
The explosive growth in big data has attracted much attention in designi...
Quantization based Fast Inner Product Search
We propose a quantization based approach for fast approximate Maximum In...
Fast Online Clustering with Randomized Skeleton Sets
We present a new fast online clustering algorithm that reliably recovers...
Compact Nonlinear Maps and Circulant Extensions
Kernel approximation via nonlinear random feature maps is widely used in...
An exploration of parameter redundancy in deep networks with circulant projections
We explore the redundancy of parameters in deep neural networks by repla...
Circulant Binary Embedding
Binary embedding of highdimensional data requires long codes to preserv...
On Learning from Label Proportions
Learning from Label Proportions (LLP) is a learning setting, where the t...
∝SVM for learning with label proportions
We study the problem of learning with label proportions in which the tra...
On the Difficulty of Nearest Neighbor Search
Fast approximate nearest neighbor (NN) search in large databases is beco...
Compact Hyperplane Hashing with Bilinear Functions
Hyperplane hashing aims at rapidly searching nearest points to a hyperpl...
Sanjiv Kumar
Research Scientist at Google Research, NY, Principal Scientist at at Google Research, NY, PhD (2005; Robotics, SCS, CMU)