Less Is Better: Unweighted Data Subsampling via Influence Function

12/03/2019 ∙ by Zifeng Wang, et al. ∙ HUAWEI Technologies Co., Ltd. Tsinghua University 20

In the time of Big Data, training complex models on large-scale data sets is challenging, making it appealing to reduce data volume for saving computation resources by subsampling. Most previous works in subsampling are weighted methods designed to help the performance of subset-model approach the full-set-model, hence the weighted methods have no chance to acquire a subset-model that is better than the full-set-model. However, we question that how can we achieve better model with less data? In this work, we propose a novel Unweighted Influence Data Subsampling (UIDS) method, and prove that the subset-model acquired through our method can outperform the full-set-model. Besides, we show that overly confident on a given test set for sampling is common in Influence-based subsampling methods, which can eventually cause our subset-model's failure in out-of-sample test. To mitigate it, we develop a probabilistic sampling scheme to control the worst-case risk over all distributions close to the empirical distribution. The experiment results demonstrate our methods superiority over existed subsampling methods in diverse tasks, such as text classification, image classification, click-through prediction, etc.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

page 4

page 5

page 7

page 8

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Bigger data can probably train a better model. It is almost the common sense nowadays for machine learning and deep learning practitioners. In the classical

Empirical Risk Minimization

(ERM) theory, it assumes that our training samples and test samples are i.i.d drawn from the same distribution. The learning theory tries to minimize the estimate of the generalization risk, namely the empirical risk. Therefore, when training samples are large enough, we can have hypothesis function working well on the test set by optimizing on the empirical risk.

However, the ERM faces several challenges: 1) a model is learned from training set which is generated by , and the model is tested on . The distribution shift from to violates the ERM’s basic assumption; 2) unknown noise in data and its label is common in reality, causing some examples harmful for model’s performance [22, 26]; 3) training on large data sets imposes significant burden on computation, some large-scale deep learning models require hundreds even thousands of GPUs.

Specifically, the subsampling approaches are initially proposed to cope with the last challenge. By virtue of subtle sampling regime, the selected subset can best approximate the original full set in terms of data distribution, hence the model can be trained on a compressed version of data set. In our work, we attempt to design sampling regime that can not only reduce the computation complexity, but also deal with several other ERM’s difficulties. For instance, reweighting examples by sampling probabilities to fix mismatch between and , and dropping noisy samples to strengthen the model’s generalization ability.

Our work can be outlined in four points. First, instead of approaching the full-set-model , we prove that the model trained by a selected subset through our subsampling method can outperform the ; second, we propose several probabilistic sampling functions, and analyze how the sampling function influences the worst-case risk [3] changes over a -divergence ball. We further propose a surrogate metric to measure the confidence degree of the sampling methods over the observed distribution, which is useful for evaluating model’s generalization ability on a set of distributions; third, for the sake of implementation efficiency, the Hessian-free mixed Preconditioned Conjugate Gradient (PCG) method is used to compute the influence function (IF) in sparse scenarios; last, complete experiments are conducted on diverse tasks to demonstrate our methods superiority over the existing state-of-the-art subsampling methods111The code can be found at https://github.com/RyanWangZf/Influence˙Subsampling.

Related work

There are two main ideas to cope with the ERM challenges aforementioned: 1) pessimistic method that tries to learn model robust to noise or bad examples, including -norm regularization, Adaboost [7], hard example mining [15] and focal loss [14]; and 2) optimistic method that modifies the input distribution directly. There are several genres of optimistic methods: the example reweighting method is used for dealing with distribution shift by [3, 9], and handling data bias by [13, 18]; the sample selection method is applied to inspect and fix mislabeled data by [27]. However, few of them have worked on alleviating the computational burden in terms of big data.

In order to reduce computation, the weighted subsampling method has been explored to approximate the maximum likelihood with a subset on the logistic regression

[6, 24], and on the generalized linear models [2]. [23]

introduces the IF in weighted subsampling for asymptotically optimal sampling probabilities for several generalized linear models. However, it is still an open problem about how to treat high variance of weight terms for weighted subsampling.

Specifically, the IF is defined by the Gateaux derivatives within the scope of Robust Statistics [11], and extended to measure example-wise influence [12] and feature-wise influence [21] on validation loss. The family of IF is mainly applied to design adversarial example and explain behaviour of black-box model previously. Recently, the IF on validation loss is used for targeting important samples, [25] builds a sample selection scheme on Deep Convolutional Networks (CNN), and [20]

builds specific influential samples selection algorithm for Gradient Boosted Decision Trees (GBDT). However, by far there has no systematic theorem to guide IF’s use in subsampling. Our work tries to build theoretical guidance for IF-based subsampling, which combines reweighting and subsampling together to synthetically cope with ERM’s challenges, e.g. distribution shift and noisy data.

Preliminaries

Training samples are generated from , and where the is the number of feature dimension. Specifically for classification task, we have hypothesis function parameterized by . The goal is to minimize the 0-1 risk , and learn the optimal . For computational tractability, researchers focus on minimization of the surrogate loss, e.g. the log loss for binary classification:

(1)

Therefore, the risk minimization problem can be empirically approximated by . For convenience, we denote by . The main notations are listed in Table 1.

, training and testing sample, .
, (or ) full-set-model and subset-model, .
’s influence of whole test set risk, .
’s influence of model parameter, .
perturbation put on ’s loss term, .
sampling probability of , .
model ’s risk on training sample , .
model ’s risk on test sample , .
training distribution.
a specific test distribution.
Table 1: Main notation.

Weighted subsampling.

For a general subsampling framework, each sample is assigned with a random variable

, indicating whether this sample is selected or not, such that [ [ is selected]. The weighted subsampling methods have the similar form of objective function on the subset:

(2)

where each term is assigned the inversion of its sampling probability as the weight. It is similar to the technique used in Causal Inference to handle the selection bias [19], the Eq.(3) derives the expectation of on :

(3)

The expectation of the on a subset is the same as the empirical risk on the full set, which means the weighted subsampling methods aim at finding optimal to let the subset risk minimizer as close to the full-set risk minimizer as possible.

This weighted approach has three main challenges: 1) we need to modify the existing training procedures to accommodate the weighted loss function; 2) because

can be small and its inversion ranges widely, the weighted loss function suffers from high variance; and 3) most importantly, as the weighted methods build a consistent estimator of the full-set-model [2], it theoretically assumes that subset-model cannot outperform the full-set-model.

Unweighted subsampling.

We propose a novel Unweighted subsampling method which does not require in its objective function:

(4)

where the means cardinality of the set. From Eq.(4), we can find the expectation of subset-model’s risk is no longer equal to the full-set-model’s. This formula can be seen as reweighting the samples with respect to their sampling probabilities implicitly. It directly overcomes the first two challenges in weighted subsampling above-mentioned, but further efforts are required to solve the last one, which will be introduced in the following section. An intuitive demonstration of the difference between the weighted and unweighted methods is shown as Fig. 1.

Figure 1: (a) if the blue points (training samples) within the red circle are removed, the new optimal decision boundary is still same as the former one; (b) if removing blue points in the red circle, the new decision boundary shifts from the left, while achieves better performance on the Te set.

Influence functions.

If the th training sample is upweighted by a small corresponding to its loss term, then the perturbed risk minimizer is

(5)

The basic idea of IF is to approximate the change of parameter [4]:

(6)

or test risk on a given test distribution from [12]:

(7)

where we have , from training and test set, and the is the Hessian matrix based on full set risk minimizer , which is positive definite (PD) if the empirical risk is twice-differentiable and strictly convex in .

Methodology

The key challenge lies on that the may not be the best risk minimizer corresponding to , due to the distribution shift between and and unknown noisy samples in the training set. With the advent of IF, we can measure one sample’s influence on test distribution without prohibitive leave-one-out training. The essential philosophy here is, given a test distribution , some samples in training set cause increasing test risk. If they are downweighted, accordingly we can have less test risk than before, namely where the is the new model learned after some harmful samples are downweighted.

Subset can be better

Considering there are perturbations put on each training sample: , the perturbed risk minimizer is denoted as . Given samples from another distribution , the objective is to design the that minimizes the test risk .

According to the definition of IF in Eq. (7), we can approximate the loss change of if is upweighed by a :

(8)

which can be extended to the whole test distribution as following:

(9)

For convenience, we use to indicate one training sample ’s influence over whole . Therefore, with the under the perturbation of , the test risk change can be approximated as following222We assume that all elements in are small, and each training sample influences the test risk independently, hence we add up all terms linearly for simplicity of implementation.:

(10)

Specifically, suppose when for , the subset is the same as the full set. In this situation, the such that for . Based on this analysis, we have the Lemma 1. For clear notion, we use bold letters to represent random variables, such that the and are the realization of the random variables and , respectively.

Lemma 1.

The expectation of the influence function over training distribution is always 0, which means:

(11)

According to the Eq. (10), minimizing test risk is equivalent to minimizing the objective function . Actually, this objective function is empirical form of the from which we derive the Lemma 2.

Lemma 2.

The subset-model performs not worse than the full-set-model in terms of test risk if and are negative correlated:

(12)

The Lemma 2 gives instruction that making the perturbation negative correlated with IF can ensure better subset-model . To this end we can let the a decreasing function. The proof of Lemma 1 and Lemma 2 can be seen in the supplementary appendix A and B, respectively.

Deterministic v.s. Probabilistic sampling

Similar to the Eq.(3), the expectation of on subset via the observation variable can be acquired by:

(13)

However, the objective function in Eq. (10) is defined on perturbation instead of sampling probability , so that we need to bridge the gap between them:

(14)

The Eq. (14) holds because from Lemma 1 we know the . Here we assume that 333The means no perturbation is applied, while the means a sample is totally dropped in objective function, here all perturbations are assumed within this interval., therefore, if we let , then the perturbation is transformed to sampling probability, because the Eq.(13) and Eq. (14) are in the same form.

In fact, the Eq.(13) has closed form of optimal :

(15)

This form of sampling is termed as Data dropout in [25] while here we call it Deterministic sampling, because it simply sets a threshold and selects samples deterministically. By contrast, sampling with a continuous function is called Probabilistic sampling since each sample has a probability to be selected. Most of sampling studies belong to probabilistic sampling methods: [24] builds and based on A-optimality and L-optimality respectively, and [23] uses .

Analysis of sampling functions

For Dropout method, the sample’s influence over a is the essential criterion for sampling, such that the obtained subset-model ends up being optimal only for a . However, the subset-model’s robustness to distribution shift [10] is also a concern. That is, for a set of distributions around the empirical one, whether our subset-model can still maintain its performance. In this work, we postulate that the Influence-based subsampling confronts to trade the subset-model’s performance on a specific off its distributional robustness. In this viewpoint, the Dropout method is overly confident on a at the expense of deteriorating generalization ability, hence it is reasonable to measure and control this confidence degree for our subsampling methods.

Considering an uncertainty set , where the denotes that is absolutely continuous w.r.t. , and means -divergence:

(16)

The is a -divergence distribution ball, indicating all neighborhoods of the empirical distribution . The worst-case risk is defined as the supremum of the risk over any [3]:

(17)

From the [5], the dual form of the Eq. (17) is:

(18)

where the is the dual variable. This duality transforms the supremum in Eq.(17) to a convex function on the empirical distribution , thus allowing us to measure the worst-case risk quantitatively. Before focusing on analyzing how the worst-case risk changes with the different sampling functions in Theorem 3, we need to introduce two terms:

Definition 1.

A function f(x): is said to be Lipschitz continuous with constant if .

Definition 2.

A function f(x) has -bounded gradients if for all .

Theorem 3.

Let the optimal dual variable that achieves the infimum in the Eq. (4), and the perturbation function has -bounded gradients. Then, the worst-case risk

is a Lipschitz continuos function w.r.t. the IF vector

where we have the Lipschitz constant , that is

The Theorem 3 relate the change rate of the worst-case risk to the gradient bound of the perturbation function . For the Dropout method, its sampling function Eq.(15) has unbounded gradient since it is inconsistent at the zero point, causing the . This property makes no longer Lipschitz continuous and suffers sharp fluctuation. By contrast, our probabilistic methods can adjust the confidence degree by tuning the . This is crucial to avoid over confidence on a specific that leads to large risk on other . In fact, our experiments bear out that our probabilistic methods maintain its performance out-of-sample with proper , while the Dropout method often crashes. The proof of Theorem 3 can be found in Appendix C.

Surrogate metric for confidence degree

Nevertheless, we find that is the determinant of confidence degree, it is still intractable to measure this degree quantitatively, which is important to guide our methods use in practice. Empirically, to deal with over fitting, practitioners prefer adding constraints on the model’s parameters , e.g. -norm regularizer. In our theory, we propose to apply the to evaluate the confidence degree over a specific . We term the a surrogate metric for confidence degree, and prove in Theorem 4 that it is reasonable because the has the same magnitude of Lipschitz constant as the . In detail, the worst-case risk and our surrogate metric share the same change rate corresponding to the sampling function’s gradient bound .

Theorem 4.

Let the perturbation function has -bounded gradient, and the is bounded by , that is . We have the parameter shift is Lischitz continuous with its Lipschitz constant . Specifically for , we have .

The Theorem 4 is helpful to measure our sampling methods confidence degree in practice if the radius of the -divergence ball is unknown. Theoretically, a relatively small guarantees more robust model. The proof of the Theorem 4 can be found in the supplementary appendix D.

Figure 2: Our unweighted subsampling framework.

Implementation

In this section, the unweighted subsampling method is incorprated in our framework, shown in Fig.2: we train with the full set data, and calculate the IF vector on training set with the . The sampling probabilities are acquired with the designed probabilistic sampling function afterwards. We will discuss the two basic modules of this framework: 1) calculating the IF and 2) designing probabilistic sampling functions.

Calculating influence functions

The IF in Eq.(7) can be calculated in two steps: first calculating the inverse Hessian-vector-product (HVP) , then multiply it with the for each training sample. To handle the sparse scenarios when the has high dimensions, [16] proposes to transform the inverse HVP into an optimization problem: , and solve it with Newton conjugate gradient (Newton-CG) method. Moreover, the [1] proves that the stochastic estimation makes the calculation feasible when the loss function is non-convex in . These works ensure our framework’s feasibility in both convex and non-convex scenarios. Without loss of generality, we mainly focus on convex scenarios.

When the CG converges slowly because of ill-conditioned sub-problem, the mixed preconditioner is useful to reduce CG steps[17, 8]:

(19)

where the is a weight parameter,

is the identical matrix and

is the Hessian matrix. Specifically for logistic regression model, its diagonal elements on is:

(20)

where the is the regularization parameter. Our experiments demonstrate that the Mixed PCG is efficacious for speeding up the calculation of IF.

Figure 3: A family of sampling functions, the axis is value of influence function and the axis is the probability .

Probabilistic sampling functions

From Lemma 2, better subset-model is ensured with a decreasing function . Furthermore, the Theorem 3 and 4 prove that the gradient bound of can be adjusted for confidence degree over a . We can design a family of probabilistic sampling functions with tunable hyper parameter w.r.t. . Here we develop two basic functions, termed as Linear sampling and Sigmoid sampling.

Linear sampling.

Inspired by [23], which builds as , we design a Linear sampling function where we let :

(21)

where . It is easy to prove the gradient bound of is , thus the degree of confidence relies on the for the Linear sampling. It is a little different from the , because the can be both negative and positive, which means many samples can have zero probability to be sampled. If we set a relatively high sampling ratio, like or higher, we will never get enough samples in our subset. Empirically, we find randomly picking up the negative samples reaches relatively good results.

Sigmoid sampling.

The Sigmoid function is generally used in logistic regression to scale the outputs into

, which indicates probability of each class, such that here we can use it to transform to probability as following:

(22)

where . For the Sigmoid function, we can still adjust the

to make the probability distribution more flat or steep, thereby control the confidence degree.

Experiments

In this section, we present data sets and experiment settings at first, and introduce several baselines for comparison. After that, we do experiments to evaluate our methods in terms of effectiveness, robustness and efficiency.

Data sets

We perform extensive experiments on various public data sets which conclude many domains, including computer vision, natural language processing, click-through rate prediction, etc. Additionally, we test the methods on the

Company data set which contains around 100 million samples with over 10 million features. They are queries collected from a real world recommender system, whose feature set contains user history behaviour, item’s side information and contextual information, such as time and location. These data sets range from small to large, from low dimensions to high dimensions, which can testify the methods effectiveness and robustness in diverse scenarios. The data set statistics and more details about preprocessing on some data sets are described in appendix E.

Full set Random OptLR Dropout Lin-UIDS* Sig-UIDS*
UCI breast-cancer 0.0914 0.0944 0.0934 0.0785 0.0873 0.0803
diabetes 0.5170 0.5180 0.5232 0.5083 0.5127 0.5068
News20 0.5130 0.5177 0.5203 0.5072 0.5100 0.5075
UCI Adult 0.3383 0.3386 0.3549 0.3538 0.3384 0.3382
cifar10 0.6847 0.6861 0.7246 0.6851 0.6822 0.6819
MNIST 0.0245 0.0247 0.0239 0.0223 0.0245 0.0231
real-sim 0.2606 0.2668 0.2644 0.2605 0.2607 0.2609
SVHN 0.6129 0.6128 0.6757 0.6328 0.6122 0.6128
skin-nonskin 0.3527 0.3526 0.3529 0.4830 0.3713 0.3527
Criteo1% 0.4763 0.4768 0.4953 0.4786 0.4755 0.4756
Covertype 0.6936 0.6933 0.6907 0.7745 0.6872 0.6876
Avazu-app 0.3449 0.3449 0.3450 0.3576 0.3446 0.3446
Avazu-site 0.4499 0.4499 0.4505 0.5736 0.4490 0.4486
Company 0.1955 0.1956 0.1958 0.1964 0.1952 0.1953
  • The UIDS is the abbreviation of our Unweighted Influence Data Subsampling methods. The Lin- and Sig- indicates the incorporated Linear and Sigmoid sampling functions, respectively.

Table 2: Average logloss evaluated on the out-of-sample Te set when sampling ratio is set to .
Figure 4: Tr-Va v.s. Tr-Va-Te setting.

Considered baselines

We select the model trained on full data set and other three sampling approaches as the baselines for comparison: Optimal sampling [23], Data dropout [25] and Random sampling.

  1. Optimal sampling. It is a recent research in weighted subsampling that uses sampling probability proportional to : . This method aims at best approaching full set performance such that it cannot overtake full set theoretically.

  2. Data dropout. It is an unweighted subsampling approach which adopts a simple sampling strategy that is dropping out unfavorable samples whose .

  3. Random sampling. It is simple random selection on Tr which means all samples share same probability. Theoretically, this strategy cannot win over full set as well.

Experiment protocols

In our experiments, we use a Tr-Va-Te setting which is different from the Tr-Va setting as many previous work do (see the Fig. 4). Both settings proceed in three steps, and share the same first two steps: 1) training model on the full Tr, predicting on the Va, then computing the IF; 2) getting sampling probability from the IF, doing sampling on Tr to get the subset, then acquiring the subset-model . In the third step, we introduce an additional out-of-sample test set (Te) to test the (step (b.3)) rather than testing the on the Va (step (a.3)). The reason is that if we use the ’s validation loss on the Va to guide the subsampling and then train the subset-model , the testing result of on the Va cannot convince us of its generalization ability.

(a) Breast cancer
(b) MNIST
Figure 5: Test ACC with noisy labels (40% being flipped).

In fact, our framework is applicable for both convex and non-convex models, and we mainly focus on subsampling theory in this work. For implementation simplicity, we use logistic regression in all experiments. Besides, to ensure that our methods indeed achieve good performance in terms of the metrics they optimized for, i.e., we use the logistic loss for computing the influence function, such that the logloss is used as the metric in all experiments. More details about experimental settings can be found in appendix F.

(a) Validation logloss
(b) Test logloss
Figure 6: Comparison on logloss between Va set and Te set.
Figure 7: The confidence degree .

Experiment observations

Result 1: Effectiveness.

The experimental results are shown in the Table 2 where sampling ratio is set as for all sampling methods. The average test logloss 444repeat 10 times sampling from the same Tr set values are listed in each column. The bold letters indicate logloss less than the full-set-model, and the underlying ones are the best across the row. It can be seen that our Sig- and Lin-UIDS overhaul the full-set-model on most of data sets, while Dropout often fails. Besides, due to the high variance incurred by weight term, the OptLR method severely suffers. In a nutshell, the Sig-UIDS performs the best on 5 of 14 datasets, and both Lin-UIDS and Dropout achieves 4. That means over-confidence sometimes is beneficial on those homogeneous data sets, e.g. the MNIST, but the Dropout fails on all relatively large-scale and heterogeneous data sets. The probabilistic sampling methods have universal superiority over others, since it keeps robustness on a set of distributions , rather than a specific (the Va set).

Our unweighted method can downweight the bad cases which cause high test loss to the our model, which is an important reason of its ability to improve result with less data. To show the performance of our methods in noisy label situation, we perform addtional experiments with some training labels being flipped. The results show the enlarging superiority of our subsampling methods in Fig. 5.

[b] Dataset # samples # features Time cost (sec) diabetesa 768 8 0.03 news20a 19,954 1,355,192 2.77 cifar10a 60,000 3,072 5.85 criteo1%a 456,674 1,000,000 38.85 Avazu-appb 14,596,137 1,000,000 166.56 Avazu-siteb 25,832,830 1,000,000 310.66

Table 3: Time costs of computing the influence function on whole training set.
  • Run on the Intel i7-6600U CPU @2.60GHz.

  • Run on the Intel Xeon CPU E5-2670 v3 @2.30GHz.

Result 2: Robustness.

In Fig. 6, we can see that the Dropout method performs very well on Va set, however, it fails in out-of-sample test. To illustrate how the proposed surrogate metric for confidence degree works, we set sampling ratio from large to small, then observe how the surrogate metric changes. See in Fig. 7, the Dropout causes large shift, while our Sig-UIDS has as small shift as the Random sampling. This phenomenon coincides with our Theorem 4 that the Lipschitz constant . With the presence of proper in Sig-UIDS, the majority of are around , which makes the sampling process more smooth.

Result 3: Efficiency.

The Table 3 shows summary of running time. For most of the data sets, our method can calculate the IF within one minute. With large and sparse data sets, our method can achieve computation within ten minutes, which is acceptable in practice.

Conclusion & Future Work

In this work, we theoretically study the unweighted subsampling with IF, then propose a novel unweighted subsampling framework and design a family of probabilistic sampling methods. The experiments show that 1) different from the previous weighted methods, our unweighted method can acquire the subset-model that indeed wins over the full-set-model on a given test set; 2) it is crucial to evaluate the confidence degree over the empirical distribution for enhancing our subset-model ’s generalization ability.

Although our framework of the Unweighted Influence Data Subsampling (UIDS) method succeeds in improving model accuracy, there are still some interesting ideas remain to be explored. Since our framework is applicable both for convex and non-convex models, we can further testify its performance on those non-convex models, e.g. Deep Neural Networks. Another direction is to develop better approaches to deal with the over fitting issue, e.g. build a validation set selection scheme. Besides, we plan to implement our method in industry in the future.

Acknowledgement

The research of Shao-Lun Huang was funded by the Natural Science Foundation of China 61807021, Shenzhen Science and Technology Research and Development Funds (JCYJ20170818094022586), and Innovation and entrepreneurship project for overseas high-level talents of Shenzhen (KQJSCX20180327144037831). The authors would like to thank Professor Chih-Jen Lin’s insight and advice for theory and writing of this work.

References

  • [1] N. Agarwal, B. Bullins, and E. Hazan (2017) Second-order stochastic optimization for machine learning in linear time. The Journal of Machine Learning Research 18 (1), pp. 4148–4187. Cited by: Calculating influence functions.
  • [2] M. Ai, J. Yu, H. Zhang, and H. Wang (2018) Optimal subsampling algorithms for big data generalized linear models. arXiv preprint arXiv:1806.06761. Cited by: Related work, Weighted subsampling..
  • [3] J. A. Bagnell (2005)

    Robust supervised learning

    .
    In

    Proceedings, The Twentieth National Conference on Artificial Intelligence and the Seventeenth Innovative Applications of Artificial Intelligence Conference, July 9-13, 2005, Pittsburgh, Pennsylvania, USA

    ,
    pp. 714–719. Cited by: Related work, Introduction, Analysis of sampling functions.
  • [4] R. D. Cook and S. Weisberg (1980) Characterizations of an empirical influence function for detecting influential cases in regression. Technometrics 22 (4), pp. 495–508. Cited by: Influence functions..
  • [5] J. C. Duchi and H. Namkoong (2018) Learning models with uniform performance via distributionally robust optimization. ArXiv abs/1810.08750. Cited by: Analysis of sampling functions.
  • [6] W. Fithian and T. Hastie (2014) Local case-control sampling: efficient subsampling in imbalanced data sets. Annals of statistics 42 (5), pp. 1693. Cited by: Related work.
  • [7] Y. Freund and R. E. Schapire (1997) A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55 (1), pp. 119–139. Cited by: Related work.
  • [8] C. Hsia, W. Chiang, and C. Lin (2018) Preconditioned conjugate gradient methods in truncated newton frameworks for large-scale linear classification. In Asian Conference on Machine Learning, pp. 312–326. Cited by: Calculating influence functions.
  • [9] W. Hu, G. Niu, I. Sato, and M. Sugiyama (2016)

    Does distributionally robust supervised learning give robust classifiers?

    .
    In ICML, Cited by: Related work.
  • [10] W. Hu, G. Niu, I. Sato, and M. Sugiyama (2018) Does distributionally robust supervised learning give robust classifiers?. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pp. 2034–2042. Cited by: Analysis of sampling functions.
  • [11] P. J. Huber (2011) Robust statistics. Springer. Cited by: Related work.
  • [12] P. W. Koh and P. Liang (2017) Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1885–1894. Cited by: Related work, Influence functions..
  • [13] M. P. Kumar, B. Packer, and D. Koller (2010) Self-paced learning for latent variable models. In 24th Annual Conference on Neural Information Processing Systems 2010 (NIPS), pp. 1189–1197. Cited by: Related work.
  • [14] T. Lin, P. Goyal, R. B. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 2999–3007. Cited by: Related work.
  • [15] T. Malisiewicz, A. Gupta, and A. A. Efros (2011) Ensemble of exemplar-svms for object detection and beyond. In IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, November 6-13, 2011, pp. 89–96. Cited by: Related work.
  • [16] J. Martens (2010) Deep learning via hessian-free optimization.. In ICML, Vol. 27, pp. 735–742. Cited by: Calculating influence functions.
  • [17] S. G. Nash (1985) Preconditioning of truncated-newton methods. SIAM Journal on Scientific and Statistical Computing 6 (3), pp. 599–616. Cited by: Calculating influence functions.
  • [18] M. Ren, W. Zeng, B. Yang, and R. Urtasun (2018) Learning to reweight examples for robust deep learning. arXiv preprint arXiv:1803.09050. Cited by: Related work.
  • [19] T. Schnabel, A. Swaminathan, A. Singh, N. Chandak, and T. Joachims (2016) Recommendations as treatments: debiasing learning and evaluation. arXiv preprint arXiv:1602.05352. Cited by: Weighted subsampling..
  • [20] B. Sharchilev, Y. Ustinovsky, P. Serdyukov, and M. de Rijke (2018) Finding influential training samples for gradient boosted decision trees. arXiv preprint arXiv:1802.06640. Cited by: Related work.
  • [21] J. Sliwinski, M. Strobel, and Y. Zick (2019) Axiomatic characterization of data-driven influence measures for classification. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, pp. 718–725. Cited by: Related work.
  • [22] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, External Links: Link Cited by: Introduction.
  • [23] D. Ting and E. Brochu (2018) Optimal subsampling with influence functions. In Advances in Neural Information Processing Systems, pp. 3650–3659. Cited by: Related work, Deterministic v.s. Probabilistic sampling, Linear sampling., Considered baselines.
  • [24] H. Wang, R. Zhu, and P. Ma (2018) Optimal subsampling for large sample logistic regression. Journal of the American Statistical Association 113 (522), pp. 829–844. Cited by: Related work, Deterministic v.s. Probabilistic sampling.
  • [25] T. Wang, J. Huan, and B. Li (2018)

    Data dropout: optimizing training data for convolutional neural networks

    .
    In 2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI), pp. 39–46. Cited by: Related work, Deterministic v.s. Probabilistic sampling, Considered baselines.
  • [26] H. Zhang, Y. Yu, J. Jiao, E. P. Xing, L. E. Ghaoui, and M. I. Jordan (2019) Theoretically principled trade-off between robustness and accuracy. In ICML, Cited by: Introduction.
  • [27] X. Zhang, X. Zhu, and S. J. Wright (2018) Training set debugging using trusted items. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), pp. 4482–4489. External Links: Link Cited by: Related work.

Appendix A Appendix A. Proof of Lemma 1

Lemma 1.

The expectation of the Influence function over training distribution is always 0, which means:

(1)
Proof.

If for , then the expectation of is simply the average on full training set. Based on the ’s definition, we have

(2)

because in this scenario. ∎

Appendix B Appendix B. Proof of Lemma 2

Lemma 2.

The subset-model performs not worse than the full-set-model in terms of test risk if and are negative correlated:

(3)
Proof.

Decomposing the expectation, we can get . Based on the Lemma 1, the such that , which means the subset-model’s test risk is less or equal than the full-set-model’s . ∎

Appendix C Appendix C. Proof of Theorem 3

Theorem 3.

Let the optimal dual variable that achieves the infimum in the Eq. (4), and the perturbation function has -bounded gradients. Then, the worst-case risk is a Lipschitz continuos function w.r.t. the IF vector where we have the Lipschitz constant , that is

Proof.

In order to measure the ’s performance on an uncertainty set , it is common to define the worst-case risk as . And its dual form is given as:

(4)

whose gradient on the vector is a vector:

(5)

where helps the Eq.(4) reach infimum. With no loss of generality, take one element and analyze its bound:

(6)
(7)
(8)
(9)
(10)

Hence we can get the bound of the norm as:

(11)
(12)

That means the change rate of worst-case risk is aligned with the . ∎

Appendix D Appendix D. Proof of Theorem 4

Theorem 4.

Let the perturbation function has -bounded gradient, and the is bounded by , that is . We have the parameter shift is Lischitz continuous with its Lipschitz constant . Specifically for , we have .

Proof.

Note that , its gradient on is also a vector with dimensions:

(13)

In fact, proving is Lipschitz continuous is equivalent to proving is bounded. Let’s select one arbitrary element from the vector and try to derive its bound:

(14)
(15)
(16)
(17)

The first approximation Eq.(15) comes from the definition of Influence function on parameters since when . The first inequality Eq.(16) holds since as -bounded gradients. The second inequality Eq.(17) comes from the Cauchy-Schwartz inequality.

Note that is bounded, the must be bounded as well. Here we can make an approximation that if each is small, such that

(18)
(19)
(20)

The second inequality Eq.(19) holds because is bounded by . Combine the Eq.(17) and Eq.(20), we can derive that is bounded, such that the is bounded:

(21)
(22)
(23)

Therefore, we can conclude that (see the Eq. (23)), it is easy to derive that the Lipschitz constant . Specifically for (i.e. the ), we have . ∎

Appendix E Appendix E. Data Sets and Experimental Settings

[t] Dataset # samples # features Domain UCI breast-cancer 683 10 Medical diabetes 768 8 Medical news20 19,954 1,355,192 Text UCI Adult 32,561 123 Society cifar10 60,000 3,072 Image MNIST 70,000 784 Image real-sim 72,309 20,958 Physics SVHN 99,289 3,072 Image skin non-skin 245,057 3 Image criteo1% 456,674 1,000,000 CTR Covtype 581,012 54 Life Avazu-app 14,596,137 1,000,000 CTR Avazu-site 25,832,830 1,000,000 CTR Company M M CTR

Table 1: Data sets statistics

Data set

The data sets statistics can be found in Table 1, and several of them are processed specifically.

MNIST, cifar10 and SVHN.

They are all 10-classes image classification data sets while Logistic regression can only handle binary classification. On MNIST and SVHN, we select the number 1 and 7 as positive and negative classes; On cifar10, we do classification on cat and dog. For each image we convert all pixels to flattened feature values with all being scaled by .

Covertype.

It is a multi-class forest cover type classification dataset which is transformed to binary class and all features are scaled to .

News20.

This is a size-balanced two-class variant of the UCI 20 Newsgroup data set where the each class contains 10 classes and each example vector is normalized to unit length.

Criteo1%.

It is used in a CTR prediction competition held jointly by Kaggle and Criteo in 2014. The data used here is conducted feature engineering according to winning solution in this competition. We ramdomly sample examples from the original data set.

Avazu-app and Avazu-site.

This data is used in a CTR prediction competition held jointly by Kaggle and Avazu in 2014. Here the data is generated according to winning solution where the data is split into two groups ”app” and ”site” for better performance.

Experimental settings

For logistic regression on both full set and subset, we select the regularization term for fair comparison. For the Optimal sampling methods, we set to scale the probability into and set to prevent the from large variance following. For Data dropout method, we rank the samples by their IF and select the top ones; For Linear sampling function, we set similar to the Optimal sampling and we randomly pick those unfavorable samples if samples with are not enough to reach the objective sampling ratio; for Sigmoid sampling, we set .

For public data, we randomly pick up data from Tr as the Va for each data set. For the company data, with domain knowledge we use 7 days data as Tr, 1 day for Va and 1 day for Te. For all subsampling methods, the Tr, Va and Te maintain the same for fair comparison. Besides, to make the test logloss comparable among different subsampling methods, postive-negative sample ratio is kept invariant after subsampling for all methods, which avoids the test logloss being influenced by the shift of label ratio.