Undersampling is a Minimax Optimal Robustness Intervention in Nonparametric Classification

05/26/2022
by   Niladri S. Chatterji, et al.
0

While a broad range of techniques have been proposed to tackle distribution shift, the simple baseline of training on an undersampled dataset often achieves close to state-of-the-art-accuracy across several popular benchmarks. This is rather surprising, since undersampling algorithms discard excess majority group data. To understand this phenomenon, we ask if learning is fundamentally constrained by a lack of minority group samples. We prove that this is indeed the case in the setting of nonparametric binary classification. Our results show that in the worst case, an algorithm cannot outperform undersampling unless there is a high degree of overlap between the train and test distributions (which is unlikely to be the case in real-world datasets), or if the algorithm leverages additional structure about the distribution shift. In particular, in the case of label shift we show that there is always an undersampling algorithm that is minimax optimal. While in the case of group-covariate shift we show that there is an undersampling algorithm that is minimax optimal when the overlap between the group distributions is small. We also perform an experimental case study on a label shift dataset and find that in line with our theory the test accuracy of robust neural network classifiers is constrained by the number of minority samples.

READ FULL TEXT
research
11/16/2021

Covariate Shift in High-Dimensional Random Feature Regression

A significant obstacle in the development of robust machine learning mod...
research
05/25/2023

Rectifying Group Irregularities in Explanations for Distribution Shift

It is well-known that real-world changes constituting distribution shift...
research
12/16/2022

An Upper Bound for the Distribution Overlap Index and Its Applications

This paper proposes an easy-to-compute upper bound for the overlap index...
research
01/11/2018

Minimax Optimality of Sign Test for Paired Heterogeneous Data

Comparing two groups under different conditions is ubiquitous in the bio...
research
03/23/2020

Minimax optimal approaches to the label shift problem

We study minimax rates of convergence in the label shift problem. In add...
research
02/06/2023

Bitrate-Constrained DRO: Beyond Worst Case Robustness To Unknown Group Shifts

Training machine learning models robust to distribution shifts is critic...
research
06/07/2023

Label Shift Quantification with Robustness Guarantees via Distribution Feature Matching

Quantification learning deals with the task of estimating the target lab...

Please sign up or login with your details

Forgot password? Click here to reset