Introduction
Bigger data can probably train a better model. It is almost the common sense nowadays for machine learning and deep learning practitioners. In the classical
Empirical Risk Minimization(ERM) theory, it assumes that our training samples and test samples are i.i.d drawn from the same distribution. The learning theory tries to minimize the estimate of the generalization risk, namely the empirical risk. Therefore, when training samples are large enough, we can have hypothesis function working well on the test set by optimizing on the empirical risk.
However, the ERM faces several challenges: 1) a model is learned from training set which is generated by , and the model is tested on . The distribution shift from to violates the ERM’s basic assumption; 2) unknown noise in data and its label is common in reality, causing some examples harmful for model’s performance [22, 26]; 3) training on large data sets imposes significant burden on computation, some largescale deep learning models require hundreds even thousands of GPUs.
Specifically, the subsampling approaches are initially proposed to cope with the last challenge. By virtue of subtle sampling regime, the selected subset can best approximate the original full set in terms of data distribution, hence the model can be trained on a compressed version of data set. In our work, we attempt to design sampling regime that can not only reduce the computation complexity, but also deal with several other ERM’s difficulties. For instance, reweighting examples by sampling probabilities to fix mismatch between and , and dropping noisy samples to strengthen the model’s generalization ability.
Our work can be outlined in four points. First, instead of approaching the fullsetmodel , we prove that the model trained by a selected subset through our subsampling method can outperform the ; second, we propose several probabilistic sampling functions, and analyze how the sampling function influences the worstcase risk [3] changes over a divergence ball. We further propose a surrogate metric to measure the confidence degree of the sampling methods over the observed distribution, which is useful for evaluating model’s generalization ability on a set of distributions; third, for the sake of implementation efficiency, the Hessianfree mixed Preconditioned Conjugate Gradient (PCG) method is used to compute the influence function (IF) in sparse scenarios; last, complete experiments are conducted on diverse tasks to demonstrate our methods superiority over the existing stateoftheart subsampling methods^{1}^{1}1The code can be found at https://github.com/RyanWangZf/Influence˙Subsampling.
Related work
There are two main ideas to cope with the ERM challenges aforementioned: 1) pessimistic method that tries to learn model robust to noise or bad examples, including norm regularization, Adaboost [7], hard example mining [15] and focal loss [14]; and 2) optimistic method that modifies the input distribution directly. There are several genres of optimistic methods: the example reweighting method is used for dealing with distribution shift by [3, 9], and handling data bias by [13, 18]; the sample selection method is applied to inspect and fix mislabeled data by [27]. However, few of them have worked on alleviating the computational burden in terms of big data.
In order to reduce computation, the weighted subsampling method has been explored to approximate the maximum likelihood with a subset on the logistic regression
[6, 24], and on the generalized linear models [2]. [23]introduces the IF in weighted subsampling for asymptotically optimal sampling probabilities for several generalized linear models. However, it is still an open problem about how to treat high variance of weight terms for weighted subsampling.
Specifically, the IF is defined by the Gateaux derivatives within the scope of Robust Statistics [11], and extended to measure examplewise influence [12] and featurewise influence [21] on validation loss. The family of IF is mainly applied to design adversarial example and explain behaviour of blackbox model previously. Recently, the IF on validation loss is used for targeting important samples, [25] builds a sample selection scheme on Deep Convolutional Networks (CNN), and [20]
builds specific influential samples selection algorithm for Gradient Boosted Decision Trees (GBDT). However, by far there has no systematic theorem to guide IF’s use in subsampling. Our work tries to build theoretical guidance for IFbased subsampling, which combines reweighting and subsampling together to synthetically cope with ERM’s challenges, e.g. distribution shift and noisy data.
Preliminaries
Training samples are generated from , and where the is the number of feature dimension. Specifically for classification task, we have hypothesis function parameterized by . The goal is to minimize the 01 risk , and learn the optimal . For computational tractability, researchers focus on minimization of the surrogate loss, e.g. the log loss for binary classification:
(1) 
Therefore, the risk minimization problem can be empirically approximated by . For convenience, we denote by . The main notations are listed in Table 1.
,  training and testing sample, . 

, (or )  fullsetmodel and subsetmodel, . 
’s influence of whole test set risk, .  
’s influence of model parameter, .  
perturbation put on ’s loss term, .  
sampling probability of , .  
model ’s risk on training sample , .  
model ’s risk on test sample , .  
training distribution.  
a specific test distribution. 
Weighted subsampling.
For a general subsampling framework, each sample is assigned with a random variable
, indicating whether this sample is selected or not, such that [ [ is selected]. The weighted subsampling methods have the similar form of objective function on the subset:(2) 
where each term is assigned the inversion of its sampling probability as the weight. It is similar to the technique used in Causal Inference to handle the selection bias [19], the Eq.(3) derives the expectation of on :
(3) 
The expectation of the on a subset is the same as the empirical risk on the full set, which means the weighted subsampling methods aim at finding optimal to let the subset risk minimizer as close to the fullset risk minimizer as possible.
This weighted approach has three main challenges: 1) we need to modify the existing training procedures to accommodate the weighted loss function; 2) because
can be small and its inversion ranges widely, the weighted loss function suffers from high variance; and 3) most importantly, as the weighted methods build a consistent estimator of the fullsetmodel [2], it theoretically assumes that subsetmodel cannot outperform the fullsetmodel.Unweighted subsampling.
We propose a novel Unweighted subsampling method which does not require in its objective function:
(4) 
where the means cardinality of the set. From Eq.(4), we can find the expectation of subsetmodel’s risk is no longer equal to the fullsetmodel’s. This formula can be seen as reweighting the samples with respect to their sampling probabilities implicitly. It directly overcomes the first two challenges in weighted subsampling abovementioned, but further efforts are required to solve the last one, which will be introduced in the following section. An intuitive demonstration of the difference between the weighted and unweighted methods is shown as Fig. 1.
Influence functions.
If the th training sample is upweighted by a small corresponding to its loss term, then the perturbed risk minimizer is
(5) 
The basic idea of IF is to approximate the change of parameter [4]:
(6) 
or test risk on a given test distribution from [12]:
(7) 
where we have , from training and test set, and the is the Hessian matrix based on full set risk minimizer , which is positive definite (PD) if the empirical risk is twicedifferentiable and strictly convex in .
Methodology
The key challenge lies on that the may not be the best risk minimizer corresponding to , due to the distribution shift between and and unknown noisy samples in the training set. With the advent of IF, we can measure one sample’s influence on test distribution without prohibitive leaveoneout training. The essential philosophy here is, given a test distribution , some samples in training set cause increasing test risk. If they are downweighted, accordingly we can have less test risk than before, namely where the is the new model learned after some harmful samples are downweighted.
Subset can be better
Considering there are perturbations put on each training sample: , the perturbed risk minimizer is denoted as . Given samples from another distribution , the objective is to design the that minimizes the test risk .
According to the definition of IF in Eq. (7), we can approximate the loss change of if is upweighed by a :
(8) 
which can be extended to the whole test distribution as following:
(9) 
For convenience, we use to indicate one training sample ’s influence over whole . Therefore, with the under the perturbation of , the test risk change can be approximated as following^{2}^{2}2We assume that all elements in are small, and each training sample influences the test risk independently, hence we add up all terms linearly for simplicity of implementation.:
(10) 
Specifically, suppose when for , the subset is the same as the full set. In this situation, the such that for . Based on this analysis, we have the Lemma 1. For clear notion, we use bold letters to represent random variables, such that the and are the realization of the random variables and , respectively.
Lemma 1.
The expectation of the influence function over training distribution is always 0, which means:
(11) 
According to the Eq. (10), minimizing test risk is equivalent to minimizing the objective function . Actually, this objective function is empirical form of the from which we derive the Lemma 2.
Lemma 2.
The subsetmodel performs not worse than the fullsetmodel in terms of test risk if and are negative correlated:
(12) 
Deterministic v.s. Probabilistic sampling
Similar to the Eq.(3), the expectation of on subset via the observation variable can be acquired by:
(13) 
However, the objective function in Eq. (10) is defined on perturbation instead of sampling probability , so that we need to bridge the gap between them:
(14) 
The Eq. (14) holds because from Lemma 1 we know the . Here we assume that ^{3}^{3}3The means no perturbation is applied, while the means a sample is totally dropped in objective function, here all perturbations are assumed within this interval., therefore, if we let , then the perturbation is transformed to sampling probability, because the Eq.(13) and Eq. (14) are in the same form.
In fact, the Eq.(13) has closed form of optimal :
(15) 
This form of sampling is termed as Data dropout in [25] while here we call it Deterministic sampling, because it simply sets a threshold and selects samples deterministically. By contrast, sampling with a continuous function is called Probabilistic sampling since each sample has a probability to be selected. Most of sampling studies belong to probabilistic sampling methods: [24] builds and based on Aoptimality and Loptimality respectively, and [23] uses .
Analysis of sampling functions
For Dropout method, the sample’s influence over a is the essential criterion for sampling, such that the obtained subsetmodel ends up being optimal only for a . However, the subsetmodel’s robustness to distribution shift [10] is also a concern. That is, for a set of distributions around the empirical one, whether our subsetmodel can still maintain its performance. In this work, we postulate that the Influencebased subsampling confronts to trade the subsetmodel’s performance on a specific off its distributional robustness. In this viewpoint, the Dropout method is overly confident on a at the expense of deteriorating generalization ability, hence it is reasonable to measure and control this confidence degree for our subsampling methods.
Considering an uncertainty set , where the denotes that is absolutely continuous w.r.t. , and means divergence:
(16) 
The is a divergence distribution ball, indicating all neighborhoods of the empirical distribution . The worstcase risk is defined as the supremum of the risk over any [3]:
(17) 
From the [5], the dual form of the Eq. (17) is:
(18) 
where the is the dual variable. This duality transforms the supremum in Eq.(17) to a convex function on the empirical distribution , thus allowing us to measure the worstcase risk quantitatively. Before focusing on analyzing how the worstcase risk changes with the different sampling functions in Theorem 3, we need to introduce two terms:
Definition 1.
A function f(x): is said to be Lipschitz continuous with constant if .
Definition 2.
A function f(x) has bounded gradients if for all .
Theorem 3.
The Theorem 3 relate the change rate of the worstcase risk to the gradient bound of the perturbation function . For the Dropout method, its sampling function Eq.(15) has unbounded gradient since it is inconsistent at the zero point, causing the . This property makes no longer Lipschitz continuous and suffers sharp fluctuation. By contrast, our probabilistic methods can adjust the confidence degree by tuning the . This is crucial to avoid over confidence on a specific that leads to large risk on other . In fact, our experiments bear out that our probabilistic methods maintain its performance outofsample with proper , while the Dropout method often crashes. The proof of Theorem 3 can be found in Appendix C.
Surrogate metric for confidence degree
Nevertheless, we find that is the determinant of confidence degree, it is still intractable to measure this degree quantitatively, which is important to guide our methods use in practice. Empirically, to deal with over fitting, practitioners prefer adding constraints on the model’s parameters , e.g. norm regularizer. In our theory, we propose to apply the to evaluate the confidence degree over a specific . We term the a surrogate metric for confidence degree, and prove in Theorem 4 that it is reasonable because the has the same magnitude of Lipschitz constant as the . In detail, the worstcase risk and our surrogate metric share the same change rate corresponding to the sampling function’s gradient bound .
Theorem 4.
Let the perturbation function has bounded gradient, and the is bounded by , that is . We have the parameter shift is Lischitz continuous with its Lipschitz constant . Specifically for , we have .
Implementation
In this section, the unweighted subsampling method is incorprated in our framework, shown in Fig.2: we train with the full set data, and calculate the IF vector on training set with the . The sampling probabilities are acquired with the designed probabilistic sampling function afterwards. We will discuss the two basic modules of this framework: 1) calculating the IF and 2) designing probabilistic sampling functions.
Calculating influence functions
The IF in Eq.(7) can be calculated in two steps: first calculating the inverse Hessianvectorproduct (HVP) , then multiply it with the for each training sample. To handle the sparse scenarios when the has high dimensions, [16] proposes to transform the inverse HVP into an optimization problem: , and solve it with Newton conjugate gradient (NewtonCG) method. Moreover, the [1] proves that the stochastic estimation makes the calculation feasible when the loss function is nonconvex in . These works ensure our framework’s feasibility in both convex and nonconvex scenarios. Without loss of generality, we mainly focus on convex scenarios.
When the CG converges slowly because of illconditioned subproblem, the mixed preconditioner is useful to reduce CG steps[17, 8]:
(19) 
where the is a weight parameter,
is the identical matrix and
is the Hessian matrix. Specifically for logistic regression model, its diagonal elements on is:(20) 
where the is the regularization parameter. Our experiments demonstrate that the Mixed PCG is efficacious for speeding up the calculation of IF.
Probabilistic sampling functions
From Lemma 2, better subsetmodel is ensured with a decreasing function . Furthermore, the Theorem 3 and 4 prove that the gradient bound of can be adjusted for confidence degree over a . We can design a family of probabilistic sampling functions with tunable hyper parameter w.r.t. . Here we develop two basic functions, termed as Linear sampling and Sigmoid sampling.
Linear sampling.
Inspired by [23], which builds as , we design a Linear sampling function where we let :
(21) 
where . It is easy to prove the gradient bound of is , thus the degree of confidence relies on the for the Linear sampling. It is a little different from the , because the can be both negative and positive, which means many samples can have zero probability to be sampled. If we set a relatively high sampling ratio, like or higher, we will never get enough samples in our subset. Empirically, we find randomly picking up the negative samples reaches relatively good results.
Sigmoid sampling.
The Sigmoid function is generally used in logistic regression to scale the outputs into
, which indicates probability of each class, such that here we can use it to transform to probability as following:(22) 
where . For the Sigmoid function, we can still adjust the
to make the probability distribution more flat or steep, thereby control the confidence degree.
Experiments
In this section, we present data sets and experiment settings at first, and introduce several baselines for comparison. After that, we do experiments to evaluate our methods in terms of effectiveness, robustness and efficiency.
Data sets
We perform extensive experiments on various public data sets which conclude many domains, including computer vision, natural language processing, clickthrough rate prediction, etc. Additionally, we test the methods on the
Company data set which contains around 100 million samples with over 10 million features. They are queries collected from a real world recommender system, whose feature set contains user history behaviour, item’s side information and contextual information, such as time and location. These data sets range from small to large, from low dimensions to high dimensions, which can testify the methods effectiveness and robustness in diverse scenarios. The data set statistics and more details about preprocessing on some data sets are described in appendix E.Full set  Random  OptLR  Dropout  LinUIDS^{*}  SigUIDS^{*}  

UCI breastcancer  0.0914  0.0944  0.0934  0.0785  0.0873  0.0803 
diabetes  0.5170  0.5180  0.5232  0.5083  0.5127  0.5068 
News20  0.5130  0.5177  0.5203  0.5072  0.5100  0.5075 
UCI Adult  0.3383  0.3386  0.3549  0.3538  0.3384  0.3382 
cifar10  0.6847  0.6861  0.7246  0.6851  0.6822  0.6819 
MNIST  0.0245  0.0247  0.0239  0.0223  0.0245  0.0231 
realsim  0.2606  0.2668  0.2644  0.2605  0.2607  0.2609 
SVHN  0.6129  0.6128  0.6757  0.6328  0.6122  0.6128 
skinnonskin  0.3527  0.3526  0.3529  0.4830  0.3713  0.3527 
Criteo1%  0.4763  0.4768  0.4953  0.4786  0.4755  0.4756 
Covertype  0.6936  0.6933  0.6907  0.7745  0.6872  0.6876 
Avazuapp  0.3449  0.3449  0.3450  0.3576  0.3446  0.3446 
Avazusite  0.4499  0.4499  0.4505  0.5736  0.4490  0.4486 
Company  0.1955  0.1956  0.1958  0.1964  0.1952  0.1953 

The UIDS is the abbreviation of our Unweighted Influence Data Subsampling methods. The Lin and Sig indicates the incorporated Linear and Sigmoid sampling functions, respectively.
Considered baselines
We select the model trained on full data set and other three sampling approaches as the baselines for comparison: Optimal sampling [23], Data dropout [25] and Random sampling.

Optimal sampling. It is a recent research in weighted subsampling that uses sampling probability proportional to : . This method aims at best approaching full set performance such that it cannot overtake full set theoretically.

Data dropout. It is an unweighted subsampling approach which adopts a simple sampling strategy that is dropping out unfavorable samples whose .

Random sampling. It is simple random selection on Tr which means all samples share same probability. Theoretically, this strategy cannot win over full set as well.
Experiment protocols
In our experiments, we use a TrVaTe setting which is different from the TrVa setting as many previous work do (see the Fig. 4). Both settings proceed in three steps, and share the same first two steps: 1) training model on the full Tr, predicting on the Va, then computing the IF; 2) getting sampling probability from the IF, doing sampling on Tr to get the subset, then acquiring the subsetmodel . In the third step, we introduce an additional outofsample test set (Te) to test the (step (b.3)) rather than testing the on the Va (step (a.3)). The reason is that if we use the ’s validation loss on the Va to guide the subsampling and then train the subsetmodel , the testing result of on the Va cannot convince us of its generalization ability.
In fact, our framework is applicable for both convex and nonconvex models, and we mainly focus on subsampling theory in this work. For implementation simplicity, we use logistic regression in all experiments. Besides, to ensure that our methods indeed achieve good performance in terms of the metrics they optimized for, i.e., we use the logistic loss for computing the influence function, such that the logloss is used as the metric in all experiments. More details about experimental settings can be found in appendix F.
Experiment observations
Result 1: Effectiveness.
The experimental results are shown in the Table 2 where sampling ratio is set as for all sampling methods. The average test logloss ^{4}^{4}4repeat 10 times sampling from the same Tr set values are listed in each column. The bold letters indicate logloss less than the fullsetmodel, and the underlying ones are the best across the row. It can be seen that our Sig and LinUIDS overhaul the fullsetmodel on most of data sets, while Dropout often fails. Besides, due to the high variance incurred by weight term, the OptLR method severely suffers. In a nutshell, the SigUIDS performs the best on 5 of 14 datasets, and both LinUIDS and Dropout achieves 4. That means overconfidence sometimes is beneficial on those homogeneous data sets, e.g. the MNIST, but the Dropout fails on all relatively largescale and heterogeneous data sets. The probabilistic sampling methods have universal superiority over others, since it keeps robustness on a set of distributions , rather than a specific (the Va set).
Our unweighted method can downweight the bad cases which cause high test loss to the our model, which is an important reason of its ability to improve result with less data. To show the performance of our methods in noisy label situation, we perform addtional experiments with some training labels being flipped. The results show the enlarging superiority of our subsampling methods in Fig. 5.
Result 2: Robustness.
In Fig. 6, we can see that the Dropout method performs very well on Va set, however, it fails in outofsample test. To illustrate how the proposed surrogate metric for confidence degree works, we set sampling ratio from large to small, then observe how the surrogate metric changes. See in Fig. 7, the Dropout causes large shift, while our SigUIDS has as small shift as the Random sampling. This phenomenon coincides with our Theorem 4 that the Lipschitz constant . With the presence of proper in SigUIDS, the majority of are around , which makes the sampling process more smooth.
Result 3: Efficiency.
The Table 3 shows summary of running time. For most of the data sets, our method can calculate the IF within one minute. With large and sparse data sets, our method can achieve computation within ten minutes, which is acceptable in practice.
Conclusion & Future Work
In this work, we theoretically study the unweighted subsampling with IF, then propose a novel unweighted subsampling framework and design a family of probabilistic sampling methods. The experiments show that 1) different from the previous weighted methods, our unweighted method can acquire the subsetmodel that indeed wins over the fullsetmodel on a given test set; 2) it is crucial to evaluate the confidence degree over the empirical distribution for enhancing our subsetmodel ’s generalization ability.
Although our framework of the Unweighted Influence Data Subsampling (UIDS) method succeeds in improving model accuracy, there are still some interesting ideas remain to be explored. Since our framework is applicable both for convex and nonconvex models, we can further testify its performance on those nonconvex models, e.g. Deep Neural Networks. Another direction is to develop better approaches to deal with the over fitting issue, e.g. build a validation set selection scheme. Besides, we plan to implement our method in industry in the future.
Acknowledgement
The research of ShaoLun Huang was funded by the Natural Science Foundation of China 61807021, Shenzhen Science and Technology Research and Development Funds (JCYJ20170818094022586), and Innovation and entrepreneurship project for overseas highlevel talents of Shenzhen (KQJSCX20180327144037831). The authors would like to thank Professor ChihJen Lin’s insight and advice for theory and writing of this work.
References
 [1] (2017) Secondorder stochastic optimization for machine learning in linear time. The Journal of Machine Learning Research 18 (1), pp. 4148–4187. Cited by: Calculating influence functions.
 [2] (2018) Optimal subsampling algorithms for big data generalized linear models. arXiv preprint arXiv:1806.06761. Cited by: Related work, Weighted subsampling..

[3]
(2005)
Robust supervised learning
. InProceedings, The Twentieth National Conference on Artificial Intelligence and the Seventeenth Innovative Applications of Artificial Intelligence Conference, July 913, 2005, Pittsburgh, Pennsylvania, USA
, pp. 714–719. Cited by: Related work, Introduction, Analysis of sampling functions.  [4] (1980) Characterizations of an empirical influence function for detecting influential cases in regression. Technometrics 22 (4), pp. 495–508. Cited by: Influence functions..
 [5] (2018) Learning models with uniform performance via distributionally robust optimization. ArXiv abs/1810.08750. Cited by: Analysis of sampling functions.
 [6] (2014) Local casecontrol sampling: efficient subsampling in imbalanced data sets. Annals of statistics 42 (5), pp. 1693. Cited by: Related work.
 [7] (1997) A decisiontheoretic generalization of online learning and an application to boosting. Journal of Computer and System Sciences 55 (1), pp. 119–139. Cited by: Related work.
 [8] (2018) Preconditioned conjugate gradient methods in truncated newton frameworks for largescale linear classification. In Asian Conference on Machine Learning, pp. 312–326. Cited by: Calculating influence functions.

[9]
(2016)
Does distributionally robust supervised learning give robust classifiers?
. In ICML, Cited by: Related work.  [10] (2018) Does distributionally robust supervised learning give robust classifiers?. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 1015, 2018, pp. 2034–2042. Cited by: Analysis of sampling functions.
 [11] (2011) Robust statistics. Springer. Cited by: Related work.
 [12] (2017) Understanding blackbox predictions via influence functions. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1885–1894. Cited by: Related work, Influence functions..
 [13] (2010) Selfpaced learning for latent variable models. In 24th Annual Conference on Neural Information Processing Systems 2010 (NIPS), pp. 1189–1197. Cited by: Related work.
 [14] (2017) Focal loss for dense object detection. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 2229, 2017, pp. 2999–3007. Cited by: Related work.
 [15] (2011) Ensemble of exemplarsvms for object detection and beyond. In IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, November 613, 2011, pp. 89–96. Cited by: Related work.
 [16] (2010) Deep learning via hessianfree optimization.. In ICML, Vol. 27, pp. 735–742. Cited by: Calculating influence functions.
 [17] (1985) Preconditioning of truncatednewton methods. SIAM Journal on Scientific and Statistical Computing 6 (3), pp. 599–616. Cited by: Calculating influence functions.
 [18] (2018) Learning to reweight examples for robust deep learning. arXiv preprint arXiv:1803.09050. Cited by: Related work.
 [19] (2016) Recommendations as treatments: debiasing learning and evaluation. arXiv preprint arXiv:1602.05352. Cited by: Weighted subsampling..
 [20] (2018) Finding influential training samples for gradient boosted decision trees. arXiv preprint arXiv:1802.06640. Cited by: Related work.
 [21] (2019) Axiomatic characterization of datadriven influence measures for classification. In The ThirtyThird AAAI Conference on Artificial Intelligence, AAAI 2019, pp. 718–725. Cited by: Related work.
 [22] (2014) Intriguing properties of neural networks. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 1416, 2014, Conference Track Proceedings, External Links: Link Cited by: Introduction.
 [23] (2018) Optimal subsampling with influence functions. In Advances in Neural Information Processing Systems, pp. 3650–3659. Cited by: Related work, Deterministic v.s. Probabilistic sampling, Linear sampling., Considered baselines.
 [24] (2018) Optimal subsampling for large sample logistic regression. Journal of the American Statistical Association 113 (522), pp. 829–844. Cited by: Related work, Deterministic v.s. Probabilistic sampling.

[25]
(2018)
Data dropout: optimizing training data for convolutional neural networks
. In 2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI), pp. 39–46. Cited by: Related work, Deterministic v.s. Probabilistic sampling, Considered baselines.  [26] (2019) Theoretically principled tradeoff between robustness and accuracy. In ICML, Cited by: Introduction.
 [27] (2018) Training set debugging using trusted items. In Proceedings of the ThirtySecond AAAI Conference on Artificial Intelligence, (AAAI18), pp. 4482–4489. External Links: Link Cited by: Related work.
Appendix A Appendix A. Proof of Lemma 1
Lemma 1.
The expectation of the Influence function over training distribution is always 0, which means:
(1) 
Proof.
If for , then the expectation of is simply the average on full training set. Based on the ’s definition, we have
(2) 
because in this scenario. ∎
Appendix B Appendix B. Proof of Lemma 2
Lemma 2.
The subsetmodel performs not worse than the fullsetmodel in terms of test risk if and are negative correlated:
(3) 
Proof.
Decomposing the expectation, we can get . Based on the Lemma 1, the such that , which means the subsetmodel’s test risk is less or equal than the fullsetmodel’s . ∎
Appendix C Appendix C. Proof of Theorem 3
Theorem 3.
Let the optimal dual variable that achieves the infimum in the Eq. (4), and the perturbation function has bounded gradients. Then, the worstcase risk is a Lipschitz continuos function w.r.t. the IF vector where we have the Lipschitz constant , that is
Proof.
In order to measure the ’s performance on an uncertainty set , it is common to define the worstcase risk as . And its dual form is given as:
(4) 
whose gradient on the vector is a vector:
(5) 
where helps the Eq.(4) reach infimum. With no loss of generality, take one element and analyze its bound:
(6)  
(7)  
(8)  
(9)  
(10) 
Hence we can get the bound of the norm as:
(11)  
(12) 
That means the change rate of worstcase risk is aligned with the . ∎
Appendix D Appendix D. Proof of Theorem 4
Theorem 4.
Let the perturbation function has bounded gradient, and the is bounded by , that is . We have the parameter shift is Lischitz continuous with its Lipschitz constant . Specifically for , we have .
Proof.
Note that , its gradient on is also a vector with dimensions:
(13) 
In fact, proving is Lipschitz continuous is equivalent to proving is bounded. Let’s select one arbitrary element from the vector and try to derive its bound:
(14)  
(15)  
(16)  
(17) 
The first approximation Eq.(15) comes from the definition of Influence function on parameters since when . The first inequality Eq.(16) holds since as bounded gradients. The second inequality Eq.(17) comes from the CauchySchwartz inequality.
Note that is bounded, the must be bounded as well. Here we can make an approximation that if each is small, such that
(18)  
(19)  
(20) 
The second inequality Eq.(19) holds because is bounded by . Combine the Eq.(17) and Eq.(20), we can derive that is bounded, such that the is bounded:
(21)  
(22)  
(23) 
Therefore, we can conclude that (see the Eq. (23)), it is easy to derive that the Lipschitz constant . Specifically for (i.e. the ), we have . ∎
Appendix E Appendix E. Data Sets and Experimental Settings
Data set
The data sets statistics can be found in Table 1, and several of them are processed specifically.
MNIST, cifar10 and SVHN.
They are all 10classes image classification data sets while Logistic regression can only handle binary classification. On MNIST and SVHN, we select the number 1 and 7 as positive and negative classes; On cifar10, we do classification on cat and dog. For each image we convert all pixels to flattened feature values with all being scaled by .
Covertype.
It is a multiclass forest cover type classification dataset which is transformed to binary class and all features are scaled to .
News20.
This is a sizebalanced twoclass variant of the UCI 20 Newsgroup data set where the each class contains 10 classes and each example vector is normalized to unit length.
Criteo1%.
It is used in a CTR prediction competition held jointly by Kaggle and Criteo in 2014. The data used here is conducted feature engineering according to winning solution in this competition. We ramdomly sample examples from the original data set.
Avazuapp and Avazusite.
This data is used in a CTR prediction competition held jointly by Kaggle and Avazu in 2014. Here the data is generated according to winning solution where the data is split into two groups ”app” and ”site” for better performance.
Experimental settings
For logistic regression on both full set and subset, we select the regularization term for fair comparison. For the Optimal sampling methods, we set to scale the probability into and set to prevent the from large variance following. For Data dropout method, we rank the samples by their IF and select the top ones; For Linear sampling function, we set similar to the Optimal sampling and we randomly pick those unfavorable samples if samples with are not enough to reach the objective sampling ratio; for Sigmoid sampling, we set .
For public data, we randomly pick up data from Tr as the Va for each data set. For the company data, with domain knowledge we use 7 days data as Tr, 1 day for Va and 1 day for Te. For all subsampling methods, the Tr, Va and Te maintain the same for fair comparison. Besides, to make the test logloss comparable among different subsampling methods, postivenegative sample ratio is kept invariant after subsampling for all methods, which avoids the test logloss being influenced by the shift of label ratio.
Comments
There are no comments yet.