Less Is Better: Unweighted Data Subsampling via Influence Function

12/03/2019
by   Zifeng Wang, et al.
20

In the time of Big Data, training complex models on large-scale data sets is challenging, making it appealing to reduce data volume for saving computation resources by subsampling. Most previous works in subsampling are weighted methods designed to help the performance of subset-model approach the full-set-model, hence the weighted methods have no chance to acquire a subset-model that is better than the full-set-model. However, we question that how can we achieve better model with less data? In this work, we propose a novel Unweighted Influence Data Subsampling (UIDS) method, and prove that the subset-model acquired through our method can outperform the full-set-model. Besides, we show that overly confident on a given test set for sampling is common in Influence-based subsampling methods, which can eventually cause our subset-model's failure in out-of-sample test. To mitigate it, we develop a probabilistic sampling scheme to control the worst-case risk over all distributions close to the empirical distribution. The experiment results demonstrate our methods superiority over existed subsampling methods in diverse tasks, such as text classification, image classification, click-through prediction, etc.

READ FULL TEXT

page 2

page 3

page 4

page 5

page 7

page 8

page 9

page 10

research
07/09/2016

Classifier Risk Estimation under Limited Labeling Resources

In this paper we propose strategies for estimating performance of a clas...
research
07/29/2022

A model robust sub-sampling approach for Generalised Linear Models in Big data settings

In today's modern era of Big data, computationally efficient and scalabl...
research
11/09/2019

How bad is worst-case data if you know where it comes from?

We introduce a framework for studying how distributional assumptions on ...
research
02/05/2007

On the variance of subset sum estimation

For high volume data streams and large data warehouses, sampling is used...
research
09/15/2023

HINT: Healthy Influential-Noise based Training to Defend against Data Poisoning Attacks

While numerous defense methods have been proposed to prohibit potential ...
research
05/22/2023

Relabel Minimal Training Subset to Flip a Prediction

Yang et al. (2023) discovered that removing a mere 1 often lead to the f...
research
02/04/2023

How Many and Which Training Points Would Need to be Removed to Flip this Prediction?

We consider the problem of identifying a minimal subset of training data...

Please sign up or login with your details

Forgot password? Click here to reset