The Value of Out-of-Distribution Data

08/23/2022
by   Ashwin De Silva, et al.
9

More data helps us generalize to a task. But real datasets can contain out-of-distribution (OOD) data; this can come in the form of heterogeneity such as intra-class variability but also in the form of temporal shifts or concept drifts. We demonstrate a counter-intuitive phenomenon for such problems: generalization error of the task can be a non-monotonic function of the number of OOD samples; a small number of OOD samples can improve generalization but if the number of OOD samples is beyond a threshold, then the generalization error can deteriorate. We also show that if we know which samples are OOD, then using a weighted objective between the target and OOD samples ensures that the generalization error decreases monotonically. We demonstrate and analyze this issue using linear classifiers on synthetic datasets and medium-sized neural networks on CIFAR-10.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/14/2019

The generalization error of random features regression: Precise asymptotics and double descent curve

Deep learning methods operate in regimes that defy the traditional stati...
research
12/23/2022

Generalization Bounds for Transfer Learning with Pretrained Classifiers

We study the ability of foundation models to learn representations for c...
research
06/22/2020

DeepTopPush: Simple and Scalable Method for Accuracy at the Top

Accuracy at the top is a special class of binary classification problems...
research
03/09/2021

On the Generalization Power of Overfitted Two-Layer Neural Tangent Kernel Models

In this paper, we study the generalization performance of min ℓ_2-norm o...
research
07/08/2020

Predicting the Accuracy of a Few-Shot Classifier

In the context of few-shot learning, one cannot measure the generalizati...
research
09/29/2021

On the Provable Generalization of Recurrent Neural Networks

Recurrent Neural Network (RNN) is a fundamental structure in deep learni...
research
01/28/2020

Identifying Mislabeled Data using the Area Under the Margin Ranking

Not all data in a typical training set help with generalization; some sa...

Please sign up or login with your details

Forgot password? Click here to reset