b'Sanjiv Kumar'

research

∙ 07/06/2023

When Does Confidence-Based Cascade Deferral Suffice?

Cascades are a classical strategy to enable inference cost to vary adapt...

0 Wittawat Jitkrittum, et al. ∙

research

∙ 05/13/2023

Depth Dependence of μP Learning Rates in ReLU MLPs

In this short note we consider random fully connected ReLU networks of w...

7 Samy Jelassi, et al. ∙

research

∙ 02/03/2023

ResMem: Learn what you can and memorize the rest

The impressive generalization performance of modern neural networks is a...

2 Zitong Yang, et al. ∙

research

∙ 01/30/2023

On student-teacher deviations in distillation: does it pay to disobey?

Knowledge distillation has been widely-used to improve the performance o...

10 Vaishnavh Nagarajan, et al. ∙

research

∙ 01/29/2023

Learning to reject meets OOD detection: Are all abstentions created equal?

Learning to reject (L2R) and out-of-distribution (OOD) detection are two...

3 Harikrishna Narasimhan, et al. ∙

research

∙ 01/28/2023

Supervision Complexity and its Role in Knowledge Distillation

Despite the popularity and efficacy of knowledge distillation, there is ...

8 Hrayr Harutyunyan, et al. ∙

research

∙ 01/28/2023

Leveraging Importance Weights in Subset Selection

We present a subset selection algorithm designed to work with arbitrary ...

0 Gui Citovsky, et al. ∙

research

∙ 01/27/2023

EmbedDistill: A Geometric Knowledge Distillation for Information Retrieval

Large neural models (such as Transformers) achieve state-of-the-art perf...

12 Seungyeon Kim, et al. ∙

research

∙ 01/04/2023

Automating Nearest Neighbor Search Configuration with Constrained Optimization

The approximate nearest neighbor (ANN) search problem is fundamental to ...

6 Philip Sun, et al. ∙

research

∙ 11/09/2022

Large Language Models with Controllable Working Memory

Large language models (LLMs) have led to a series of breakthroughs in na...

6 Daliang Li, et al. ∙

research

∙ 11/01/2022

Preserving In-Context Learning ability in Large Language Model Fine-tuning

Pretrained large language models (LLMs) are strong in-context learners t...

5 Yihan Wang, et al. ∙

research

∙ 10/28/2022

When does mixup promote local linearity in learned representations?

Mixup is a regularization technique that artificially produces new sampl...

0 Arslan Chaudhry, et al. ∙

research

∙ 10/12/2022

Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers

This paper studies the curious phenomenon for machine learning models wi...

19 Zonglin Li, et al. ∙

research

∙ 10/11/2022

Decoupled Context Processing for Context Augmented Language Modeling

Language models can be augmented with a context retriever to incorporate...

9 Zonglin Li, et al. ∙

research

∙ 06/28/2022

TPU-KNN: K Nearest Neighbor Search at Peak FLOP/s

This paper presents a novel nearest neighbor search algorithm achieving ...

15 Felix Chern, et al. ∙

research

∙ 04/27/2022

ELM: Embedding and Logit Margins for Long-Tail Learning

Long-tail learning is the problem of learning under skewed label distrib...

9 Wittawat Jitkrittum, et al. ∙

research

∙ 02/15/2022

Predicting on the Edge: Identifying Where a Larger Model Does Better

Much effort has been devoted to making large and more accurate models, b...

6 Taman Narayan, et al. ∙

research

∙ 02/02/2022

Robust Training of Neural Networks using Scale Invariant Architectures

In contrast to SGD, adaptive gradient methods like Adam allow robust tra...

6 Zhiyuan Li, et al. ∙

research

∙ 10/19/2021

When in Doubt, Summon the Titans: Efficient Inference with Large Models

Scaling neural networks to "large" sizes, with billions of parameters, h...

5 Ankit Singh Rawat, et al. ∙

research

∙ 07/29/2021

Batch Active Learning at Scale

The ability to train complex and highly effective models often requires ...

2 Gui Citovsky, et al. ∙

research

∙ 06/19/2021

Teacher's pet: understanding and mitigating biases in distillation

Knowledge distillation is widely used as a means of improving the perfor...

5 Michal Lukasik, et al. ∙

research

∙ 05/25/2021

Scaling Hierarchical Agglomerative Clustering to Billion-sized Datasets

Hierarchical Agglomerative Clustering (HAC) is one of the oldest but sti...

12 Baris Sumengen, et al. ∙

research

∙ 05/19/2021

Balancing Robustness and Sensitivity using Feature Contrastive Learning

It is generally believed that robust training of extremely large network...

11 Seungyeon Kim, et al. ∙

research

∙ 05/12/2021

Disentangling Sampling and Labeling Bias for Learning in Large-Output Spaces

Negative sampling schemes enable efficient training given a large number...

2 Ankit Singh Rawat, et al. ∙

research

∙ 04/26/2021

Balancing Constraints and Submodularity in Data Subset Selection

Deep learning has yielded extraordinary results in vision and natural la...

7 Srikumar Ramalingam, et al. ∙

research

∙ 02/05/2021

On the Reproducibility of Neural Network Predictions

Standard training techniques for neural networks involve multiple source...

14 Srinadh Bhojanapalli, et al. ∙

research

∙ 12/01/2020

Modifying Memories in Transformer Models

Large Transformer models have achieved impressive performance in many na...

0 Chen Zhu, et al. ∙

research

∙ 10/23/2020

Coping with Label Shift via Distributionally Robust Optimisation

The label shift problem refers to the supervised learning setting where ...

6 Jingzhao Zhang, et al. ∙

research

∙ 07/27/2020

Learning discrete distributions: user vs item-level privacy

Much of the literature on differential privacy focuses on item-level pri...

5 Yuhan Liu, et al. ∙

research

∙ 06/08/2020

O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers

Transformer networks use pairwise attention to compute contextual embedd...

5 Chulhee Yun, et al. ∙

research

∙ 05/31/2020

Evaluations and Methods for Explanation through Robustness Analysis

Among multiple ways of interpreting a machine learning model, measuring ...

15 Cheng-Yu Hsieh, et al. ∙

research

∙ 05/21/2020

Why distillation helps: a statistical perspective

Knowledge distillation is a technique for improving the performance of a...

41 Aditya Krishna Menon, et al. ∙

research

∙ 04/23/2020

Doubly-stochastic mining for heterogeneous retrieval

Modern retrieval problems are characterised by training sets with potent...

6 Ankit Singh Rawat, et al. ∙

research

∙ 04/21/2020

Federated Learning with Only Positive Labels

We consider learning a multi-class classification model in the federated...

9 Felix X. Yu, et al. ∙

research

∙ 04/11/2020

Robust Large-Margin Learning in Hyperbolic Space

Recently, there has been a surge of interest in representation learning ...

12 Melanie Weber, et al. ∙

research

∙ 03/05/2020

Does label smoothing mitigate label noise?

Label smoothing is commonly used in training deep learning models, where...

11 Michal Lukasik, et al. ∙

research

∙ 02/29/2020

Adaptive Federated Optimization

Federated learning is a distributed machine learning paradigm in which a...

28 Sashank Reddi, et al. ∙

research

∙ 02/17/2020

Low-Rank Bottleneck in Multi-head Attention Models

Attention based Transformer architecture has enabled significant advance...

9 Srinadh Bhojanapalli, et al. ∙

research

∙ 02/10/2020

Pre-training Tasks for Embedding-based Large-scale Retrieval

We consider the large-scale query-document retrieval problem: given a qu...

4 Wei-Cheng Chang, et al. ∙

research

∙ 12/20/2019

Are Transformers universal approximators of sequence-to-sequence functions?

Despite the widespread adoption of Transformer models for NLP tasks, the...

24 Chulhee Yun, et al. ∙

research

∙ 12/06/2019

Why ADAM Beats SGD for Attention Models

While stochastic gradient descent (SGD) is still the de facto algorithm ...

0 Jingzhao Zhang, et al. ∙

research

∙ 10/21/2019

Learning to Learn by Zeroth-Order Oracle

In the learning to learn (L2L) framework, we cast the design of optimiza...

18 Yangjun Ruan, et al. ∙

research

∙ 09/20/2019

Online Hierarchical Clustering Approximations

Hierarchical clustering is a widely used approach for clustering dataset...

10 Aditya Krishna Menon, et al. ∙

research

∙ 08/27/2019

New Loss Functions for Fast Maximum Inner Product Search

Quantization based methods are popular for solving large scale maximum i...

11 Ruiqi Guo, et al. ∙

research

∙ 08/20/2019

AdaCliP: Adaptive Clipping for Private SGD

Privacy preserving machine learning algorithms are crucial for learning ...

1 Venkatadheeraj Pichapati, et al. ∙

research

∙ 07/24/2019

Sampled Softmax with Random Fourier Features

The computational cost of training with softmax cross entropy loss grows...

5 Ankit Singh Rawat, et al. ∙

research

∙ 06/05/2019

Neural SDE: Stabilizing Neural ODE Networks with Stochastic Noise

Neural Ordinary Differential Equation (Neural ODE) has been proposed as ...

9 Xuanqing Liu, et al. ∙

research

∙ 04/19/2019

On the Convergence of Adam and Beyond

Several recently proposed stochastic optimization methods that have been...

30 Sashank J Reddi, et al. ∙

research

∙ 03/25/2019

Local Orthogonal Decomposition for Maximum Inner Product Search

Inverted file and asymmetric distance computation (IVFADC) have been suc...

8 Xiang Wu, et al. ∙

research

∙ 03/20/2019

Efficient Inner Product Approximation in Hybrid Spaces

Many emerging use cases of data mining and machine learning operate on l...

10 Xiang Wu, et al. ∙

Sanjiv Kumar

Featured Co-authors

Sign in with Google

Consider DeepAI Pro