Train on Validation: Squeezing the Data Lemon

02/16/2018
by   Guy Tennenholtz, et al.
0

Model selection on validation data is an essential step in machine learning. While the mixing of data between training and validation is considered taboo, practitioners often violate it to increase performance. Here, we offer a simple, practical method for using the validation set for training, which allows for a continuous, controlled trade-off between performance and overfitting of model selection. We define the notion of on-average-validation-stable algorithms as one in which using small portions of validation data for training does not overfit the model selection process. We then prove that stable algorithms are also validation stable. Finally, we demonstrate our method on the MNIST and CIFAR-10 datasets using stable algorithms as well as state-of-the-art neural networks. Our results show significant increase in test performance with a minor trade-off in bias admitted to the model selection process.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 7

page 8

page 17

09/12/2009

A Nonconformity Approach to Model Selection for SVMs

We investigate the issue of model selection and the use of the nonconfor...
04/30/2015

Model Selection and Overfitting in Genetic Programming: Empirical Study [Extended Version]

Genetic Programming has been very successful in solving a large area of ...
12/08/2020

Robustness of Accuracy Metric and its Inspirations in Learning with Noisy Labels

For multi-class classification under class-conditional label noise, we p...
05/24/2019

Perturbed Model Validation: A New Framework to Validate Model Relevance

This paper introduces PMV (Perturbed Model Validation), a new technique ...
07/23/2021

Model Selection for Offline Reinforcement Learning: Practical Considerations for Healthcare Settings

Reinforcement learning (RL) can be used to learn treatment policies and ...
11/27/2014

Convex Techniques for Model Selection

We develop a robust convex algorithm to select the regularization parame...
04/01/2020

Evaluation of Model Selection for Kernel Fragment Recognition in Corn Silage

Model selection when designing deep learning systems for specific use-ca...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning and statistical analysis play a key role in the development of artificial intelligence. Model selection is the task of selecting a statistical model from a set of candidate models, given data, and is usually done by dividing the data into two sets: training and validation. The validation set is the set of examples used for model selection. The validation set provides an unbiased evaluation of each model fit on the training set, used to compare performance of the different models and decide which one to use (e.g., choosing the number of hidden layers in a neural network).

We take particular interest in the regime of “medium” sized datasets in which applying learning algorithms is practical, but additional data can boost performance. In such datasets the partitioning of the data can be harmful to overall performance. In order to overcome this problem, we propose using a “transparent” validation set in which some information can “leak out” during training (i.e., by training on a portion of the validation data). Training on some of the validation examples in a controlled manner, enables the increase of performance with a reasonable and constrained trade-off in the bias of model selection. While such a use of validation is thought of as taboo, we show that this restriction can be relaxed under certain conditions. We claim that training on validation data should not trivially be considered bad practice, and one should carefully use the information in this set for fitting a model, with an attempt to “squeeze” the most out of the available data, whenever possible.

Ideas for selecting models with the reuse of data have been studied in various fields of research Refaeilzadeh et al. (2009); Cawley & Talbot (2010); Dwork et al. (2015, 2017); Kohavi et al. (1995). A well known, admissible practice known as -Fold Cross-Validation Stone (1974); Geisser (1975); Schaffer (1993); Refaeilzadeh et al. (2009) attempts to reduce the amount of bias caused by a particular choice of validation set. Another method, known as Bootstrapping Efron (1982), samples the dataset (with replacement), validating on the unsampled results. These methods, however, do not solve the problem of the dichotomic separation of data (that is, both methods could potentially use more of the dataset during the training process). There exist common protocols that attempt to overcome this problem. The most prevalent method that is used in practice is to train on the whole dataset (train and validation) after a model has been selected. It has been demonstrated that such an incorrect use of validation data for model selection can lead to a misleading optimistic bias in performance evaluation Cawley & Talbot (2010), resulting in an unreliable choice of model.

The use of validation data for training has a potential for great bias in estimated results, but can be partially done when regularized appropriately

Cawley & Talbot (2010); Dwork et al. (2017)

. In this paper, we show there exists a complete spectrum between not using any of the validation data and using all of the validation data for training. When data is indispensable, this calibration may increase the performance of learning algorithms with a minor trade-off in selection bias. In our procedure, we sample each example from the validation set with probability

and add it to the training set. Once training is complete, the validation set is used for model selection (including examples that were used for training). In Theorem 1, we prove that using this procedure the correct model can still be chosen if the learning algorithm is stable and , where here is the chosen validation set. Our results show that when is small enough, a larger, augmented training set can be used, while simultaneously selecting a model using the partially seen validation set. In practice we observed that for iterative algorithms, such as SGD, it is possible to deploy our sampling procedure at each iteration, allowing effective use of the whole dataset for training. Our procedures establish a form of regularization to the validation examples’ effect on the training process.

Previous work has shown that using validation data for model selection in an adaptive, multiple step process, may bring undesirable ramifications. “Freedman’s Paradox” Freedman & Freedman (1983) depicts the potentially dangerous scenarios where results of one statistical procedure performed on the data are fed into another procedure performed on the same data. Freedman observed, through simulation, that when the number of variables is of the same order as the number of data points, insignificant variables can seemingly become significant. It has been shown that the reuse of the validation set adaptively can greatly increase the risk of spurious inference Dwork et al. (2015, 2017). Even stable algorithms, when running sequentially, may result in a procedure that is unstable. Contrary to our work, these studies consider an adaptive process of reusing the validation set, whereas our method selects a model only once after training is complete (i.e., no further use of the data is performed after training). Such adaptive validation procedures, while complementary to our proposed method, must be taken into consideration and performed with extra care. An exhaustive review of these methods can be found in the related work section, and a summary can be found in Table 1.

Our proposed sampling method is the first principled study of how to use validation data for training

under theoretical guarantees. In addition, we show experimental evidence that our sampling method can be applied to a broad class of algorithms, including well known stable algorithms such as linear regression, k-Nearest Neighbors

Altman (1992)

, and Support Vector Machines

Hearst et al. (1998); Shalev-Shwartz & Ben-David (2014). In addition, we conduct experiments using our method on state-of-the-art in neural networks Huang et al. (2016), showing (absolute) increase in performance on the CIFAR-10 Krizhevsky & Hinton (2009) dataset for image classification.


Method Adaptive Complementary Enlarged Potentially
Evaluation Training Set Biased
k-Fold CV Refaeilzadeh et al. (2009) -
Bootstrap Efron (1982)
Thresholdout Dwork et al. (2017)
Train after Test -
Our method - Depends on
Table 1: Comparison of different protocols to our method. Methods which are adaptive have a multiple step procedure which uses previously chosen validation accuracies for following stages. Methods which are marked as complementary can be used in conjunction with our method. Methods which are marked as enlarged are ones which effectively expand the initial training set, except for bootstrapping which trains on copies of identical examples. Finally, methods which are marked as potentially biased are likely to obtain incorrect, biased performance estimates. See the related work section for an exhaustive review of these methods.

2 Preliminaries

We consider the classical supervised learning setting, where we wish to predict a random variable

based on observations from another random variable . Let be a dataset of random pairs drawn independently from a fixed unknown distribution on . A learning algorithm is a mapping from to a hypothesis class

. We measure the quality of predictions using a loss function

. That is, measures the loss of predicting example using hypothesis . The risk associated with hypothesis is then defined as the expectation of the loss function:

In general, the risk cannot be computed because the distribution is unknown to the learning algorithm. However, we can compute an approximation, called the empirical risk, by averaging the loss function over the set . We thus define the empirical risk by The Empirical Risk Minimization Principle (ERM) states that a learning algorithm should choose a hypothesis which minimizes the empirical risk .

2.1 Model Selection

It is often essential to compare multiple learning algorithms to determine which one works best on the problem at hand. It is thus a common practice to define part of the dataset as a validation (holdout) set. The goal of using a validation set is to provide an unbiased evaluation of a model fit on the training dataset while selecting the best model.

We assume the dataset is divided into a training set and validation set (i.e., ), and define the empirical loss and validation loss by and , respectively. The former is used as the minimization criterion and the latter as the criterion for model selection.

In the finite case, let be a set possible models, where , and let be the validation loss of algorithm using model when training on . We denote by the optimal model choice, where

(1)

2.2 Stability

A stable algorithm is one in which small change to the input results in a small change to the loss. One form of stability measures the effect of replacement on the output. As we show in Theorem 1 and the Experiments Section, this property has a major effect on our ability to train on validation examples.

Let be a dataset of examples sampled independently from an unknown distribution . We would expect that for a stable algorithm , the loss would be close to , where . The On-Average-Replace-One-Stability (OAROS) criterion proposed in Shalev-Shwartz & Ben-David (2014) and recalled in Definition 1 below measures in expectation this effect on the loss by uniformly replacing one example from the training set.

Definition 1 (On-Average-Replace-One-Stability).

Let be a monotonically decreasing function. We say that a learning algorithm is On-Average-Replace-One-Stable with rate w.r.t. a loss function , if for every distribution , and training sample of data points from , we have:

where here

is the uniform distribution on the elements in

.

Other forms of stability which are closely related to OAROS include uniform stability Bousquet & Elisseeff (2002) and Leave-One-Out stability Shalev-Shwartz et al. (2009); Elisseeff et al. (2003). Intuitively, stability measures the sensitivity to perturbations in the training set. In particular, it is known that stability of the ERM is sufficient for learnability, as stated in the following theorem. It has even be argued that stability is a necessary condition for learnability Mukherjee et al. (2006).

Theorem.

Let be an ERM algorithm with OAROS rate , then

Note that stability is not a binary property of an algorithm. In fact, the stability rate offers a measure of how stable an algorithm is. As we show in the experiment section, stable algorithms are often more stable in practice (i.e., the bounds on known stability rates are looser in practice).

3 Main Result

In this section, we show that the requirement of a segregated validation set can be relaxed under stability assumptions of the learning algorithm. Our main goal is to use validation data for training, limiting the amount of bias presented to the model selection stage. One way of doing this is by defining a form of stability over validation examples when selecting the model.

Let be a procedure that receives as input the training set and the validation set , and outputs an augmented set . We wish to choose a procedure such that Equation (1) would still hold. Specifically,

(2)

To ensure this holds with high probability we choose a procedure for which the ordering of and are the same. That is, if , then . Alternatively, we can choose to randomly sample the smallest possible unit of data (i.e., one example) from and add it to . Our goal then is to ensure that for any random choice of example passed to the training procedure the loss over the validation examples would not change by much when selecting a model over the full validation set (including the sampled example). This would thereby ensure the correct choice of a model. We define a validation-stable algorithm to be one in which such an addition results in small change to the output. The On-Average-Validation-Stability (OAVS) criterion measures in expectation the effect of uniformly choosing one example from and adding it to on the loss.

Definition 2 (On Average Validation Stability).

Let be a monotonically decreasing function in both parameters separately. We say that a learning algorithm is On-Average-Validation-Stable with rate if for every distribution , training sample of data points from , and validation sample of data points, we have:

where here is the uniform distribution on the elements in .

Indeed, if an algorithm is OAVS stable with , and

(3)

sampling one unit of data from will in expectation satisfy Equation (2). Alternatively, since is monotonically decreasing in and , then for a satisfactory large training and validation set, condition (3) would hold, and thereby condition (2) would hold as well with high probability (by use of Chernoff bound, see the Appendix).

Validation-stability is an important property of a learning algorithm and bears a close relation to the stability of an algorithm. In the next theorem we show that OAROS algorithms are also OAVS (under standard convergence assumptions). A proof can be found in the Appendix.

Theorem 1.

Let and denote where . Denote . Let be an ERM algorithm which is OAROS stable with and satisfies

where is some monotonically decreasing function in . Then it holds that is On-Average-Validation-Stable with rate

Theorem 1 gives a fundamental notion of when validation examples can be used for training and to what extent. In the following section we describe several possible procedures that use validation examples and show evidence of their ability to properly select the best model, with an increase to overall performance.

4 Experiments

In this section, we describe two sampling procedures for training using a validation set. We then show results of using such procedures on stable algorithms as well as state-of-the-art in neural networks for image classification. Both procedures constitute of a control parameter which restricts the amount of validation data that is used for training.

The presample procedure samples examples before training starts. Every example from is sampled independently with probability and is added it to .

The batch-sample procedure samples full batches. At every training iteration, a batch is either taken from or . Specifically, with probability a batch is used from , and with probability a batch is used from , independent of the other batches.

The last method samples at a per-iteration-level, allowing a uniform use of the validation set, where acts as a reguralizer on validation examples (lower = higher regularization). A value of under the batch sampling procedure is comparable to a value of under the presampling procedure. This, in turn, means that the batch-sample procedure can effectively oversample validation examples (thereby “seeing” more validation examples than training examples).

Figure 1: Validation and test accuracies for the presample procedure using k-NN. Top graphs show results for unbiased validation and test sets, while bottom graphs show results for biased validation and test sets.

4.1 Stable Algorithms

The presample procedure was tested on linear regression, k-Nearest Neighbors (k-NN), and Support Vector Machines (SVM), which have been previously shown to be stable Bousquet & Elisseeff (2002); Devroye & Wagner (1979); e Elissee (2000). Results show compliance with our approach in the strongest sense. Essentially all stable algorithms that were tested did not overfit the model selection process, allowing for an absolute use of validation examples, while maintaining negligible bias in model selection. We present results for SVM and k-NN on an image classification task. Linear regression results can be found in the Appendix.

Dataset: For all of our experiments we used the MNIST dataset LeCun (1998), containing 70000, 28x28 grayscale images of handwritten digits drawn from 10 classes.

k-NN: We tested k-NN with the standard Euclidean metric. The dataset was divided to train, validation and test sets with 50000 examples used for training and 10000 examples used for selecting an optimal value of , with values between 1 and 7. Figure 1 (top) shows accuracy results on the validation and test sets for different values of . Colors of the contour map range from black to blue to white, with black depicting lower accuracy scores, and white depicting higher accuracies. The x-axis of the map shows different values with ranging from to . Every point on the grid shows a result of an experiment with a given and . Test results show a monotone increase in performance for increased values of . Model selection was found to be independent of , allowing for the use of the full validation set for training, with an optimal choice of giving a increase in overall performance.

SVM:

We used the radial basis function (RBF) kernel

. The dataset was divided to train, validation and test sets with the validation set used to choose an optimal value of . Figure 2 (top) shows accuracy results on the validation and test sets for different values of . Each line on the test accuracy plot depicts accuracies for different values of . We noted no change in model selection for all values of . Test accuracies were only slightly improved with less than increase in overall performance.

k-NN (Bias Experiment):

An additional benefit of using validation data for training is for dataset shift. Dataset shift is a common problem that occurs when the joint distribution of inputs and outputs differs between training and test stages

Quionero-Candela et al. (2009). Dataset shift is present in most practical applications with examples of user queries in search engines, utterances inputted to dialogue systems, and data on products in online stores. Such a drift in distribution may have calamitous effects on performance when using outdated data for training. Our sampling procedure, apart from effectively creating a larger dataset for training, can be used as inductive bias on the training procedure for such time-dependent data. When using a validation set of this form, the latest, most recent data sample is usually used for testing. By sampling from the validation set, we can decrease erroneous bias caused by time deformation.

We tested the effect of biased validation and test sets on our sampling procedure by introducing artificial bias to the validation and test sets. The bias was comprised of random flipping of the images (left right and up down) as well as an increased probability to class 9 (the training set distribution was set to the uniform distribution on all classes). Figure 1 (bottom) shows accuracy results using k-NN on the validation and test sets for different values of . Model selection was not affected by , allowing for the use of the full validation set for training, with an optimal choice of , and a increase in overall performance.

SVM (Training Size Experiment): To examine the effect of the size of the training set we rerun the SVM experiment with a smaller dataset - half in size. Figure 2 (bottom) shows accuracy results on the validation and test sets for the medium sized training set. While our previous SVM results showed an indiscernible effect on test accuracy, validation sampling had an evident effect on performance on the reduced training set, with increase in test accuracy.

Figure 2: Validation and test accuracies for the presample procedure using SVM and different choices of (the kernel parameter). In the medium training set experiment (bottom), test accuracy increases for , and does not changed for . In the large training set experiment (top), test accuracy does not change.

Summary:

All experiments showed an advantage in validation sampling with stable algorithms. Dataset shift bias and reduced training set size both had a clear influence on sampling efficiency. In addition, all experiments performed did not overfit the model selection process. We suspect this is due to the strong stability property of these algorithms in practice. SVM classified the data using only a small number support vectors, establishing a negligible effect in test accuracy for adding new examples. For this reason, the effect of sampling was only evident when the training set was small enough. In the next section, we show that neural networks, which may have weaker stability properties, are more susceptible to overfitting the model selection process. Nevertheless, as we’ll see next, our sampling procedure can be used to increase their performance significantly.

4.2 DenseNet

Unlike regression, SVM, and k-NN, neural networks have not yet been shown stable, yet they achieve state-of-the-art results in a variety of machine learning problems. In this section, we demonstrate our sampling procedure on Densely Connected Convolutional Networks (DenseNet) Huang et al. (2016). The DenseNet architecture contains dense shortcut connections between layers and was recently shown to achieve state of the art results on CIFAR-10 and 100.

Setup:

The dataset we used was the CIFAR-10 dataset. The dataset contains 32x32 color images drawn from 10 classes. The dataset of size 60000 examples was divided to train, validation, and test sets with 40000 examples used for training and 15000 examples used for model selection (validation set). The data was normalized by its mean and standard deviation, and a standard augmentation was applied to the training data by randomly mirroring, padding, and shifting the images.

All of our experiments used the vanilla Densenet model with three dense blocks of 12 layers each and a growth rate set to 12. The network was implemented using TensorFlow

Abadi et al. (2016) with an SGD optimizer momentum of 0.9. Weight decay was set to . For model selection, we ran experiments on two parameters: learning rate multiplier () and batch size. The initial learning rate was set to and was lowered to

at epoch

, and to at epoch . Training was done for a total of epochs. We tested 9 values of ranging from to and 4 batch sizes of and . When a parameter was not tested it was set to its vanilla value. was set to a default value of and batch size to a default value of .

Figure 3: Validation and test accuracies of the batch sampling procedure with a batch sampled with probability . Top graphs show different values of learning rate multipliers (), while bottom graphs different batch sizes. Red circles show optimal values of / batch size for each value of . Orange circles show test accuracies for the models selected when different from best model. The green line depicts the “knee” of the validation plot. Test performance scores to the left of the knee achieve accuracies of () and (batch size) when sampling with probability compared to accuracies of () and (batch size) without validation sampling. Model selection receives improper bias when () and (batch size). Validation scores increase as increases.

Results: The presample procedure (sample from validation w.p. before training starts) and batch-sample procedure (sample a validation batch w.p. ) were tested showing similar results. Figure 3 shows accuracies of the validation and test sets for changing of the and batch size using the batch-sample procedure. Colors of the contour map range from black to blue to white, with black depicting lower accuracy scores and white depicting higher accuracies. The x-axis of the map shows different values of , ranging from to . The y-axis depicts different values of / batch size. Every point on the grid shows a result of an experiment with a given and / batch size.

For each value of the best choice of or batch size is marked by a circle. For the validation accuracy plots (left), the red circles mark the best model (which would be chosen). For the test accuracy plots (right), the red circles depict the (true) optimal model, while the orange circles depict the model that would have been selected (by the validation set). Thus, if for a certain value of the red and orange circles are not at the same value, overfitting of the model selection has occurred. For instance, in Figure 3 when the chosen model of achieves a validation accuracy of (top, left) with it also having an optimal test accuracy of (top, right). On the other hand, when the chosen model of achieves a validation accuracy of (top, left), whereas its test accuracy achieves sub-optimal accuracy (top, right). The true optimal model for the case of is for with a test accuracy of .

As expected, a monotone increase in validation accuracy occurred as increased (left). This is due to the increased number of validation examples trained upon. While validation accuracy scores had a linear dependence in for all of the stable algorithms tested, DenseNet showed a strongly concave behavior. Moreover, DenseNet admitted to undeniable selection bias above a certain value of . Specifically, model selection received improper bias when (, top right) and (batch size, bottom right). Nevertheless, a valid value of always existed in which model selection was possible.

The sudden increase in validation accuracy for small values of reveals the necessity of the validation data. Meanwhile, small values of introduce negligible bias to the model selection process. This behaviour suggests choosing at an equilibrium point in which performance and model selection bias are balanced. Such a choice of cannot rely on test accuracies (see “Data Dredging” in the Related Work Section). Then, as a rule of thumb, we chose to use the “knee” of the validation plot for choosing a final value of . The knee of the validation plot represents the point in which overfitting of validation examples has not occurred yet, whereas sufficient information for training has been used. We mark the knee of the validation plot in Figure 3 by a green line, dividing the graph into two parts: left of the knee where model selection is possible, and right of the knee where selection is prone to significant bias. Indeed, all of our experiments suggest model selection before the knee did not overfit the model selection process. Test performance scores at the knee show increased accuracies of () and (batch size) compared to accuracies of () and (batch size) without validation sampling.

Summary: There exists a clear advantage to sampling validation examples during training. While stable algorithms were remarkably robust in model selection (even when the full validation set was used), our experiments on DenseNet show a clear limit to the sampling magnitude. Nevertheless, both the presample and batch-sample procedures significantly improved results, with an advantage of effectively using more validation data using batch samples.

5 Related Work

In this section, we discuss related methods for model selection and how our work is positioned relative to them. Table 1 contains a summary of these methods.

k-Fold Cross-Validation. In order to avoid overfitting in model selection, it is necessary to evaluate the models on a holdout (not used in training) validation set. However, by partitioning the available data, we drastically reduce the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of train or validation sets. One way to overcome this is by the use of -fold Cross-Validation Refaeilzadeh et al. (2009). In -fold Cross-Validation, the dataset is partitioned into equal sized sets. Of the sets, a single set is retained as the validation set, and the remaining sets are used as a training set. The results from the folds are then averaged to produce a single estimation. Leave-one-out validation is a special case of Cross-Validation, where all folds consist of a single instance. This type of validation is, of course, more expensive computationally.

While k-fold Cross-Validation reduces the bias of improper performance estimation, it is a computationally expensive process. Moreover, while -fold Cross-Validation supposedly uses all of the data for training, each fold only trains on part of the data. Our sampling method can be used in conjunction with -fold Cross-Validation, where in each fold we allow sampling of validation examples during training. This coupling can decrease bias while improving test performance. As a matter of fact, our sampling method readily achieves the -fold Cross-Validation objective, as it reduces the bias of improper performance estimation. Therefore, our method can be similarly used as an alternative to -fold Cross-Validation, with the benefit of increased test performance.

Bootstrapping. Bootstrapping Efron (1982) for model selection is a technique which involves a relatively simple procedure where properties of an estimator are estimated by sampling from an approximating distribution. Given a dataset of size , a bootstrap sample is created by sampling instances uniformly from the data (with replacement). The accuracy estimate is then derived by using the bootstrap sample for training and the rest of the instances for testing. Such a procedure allows for a data split of approximately for validation. It has been shown that the bootstrap procedure works well under stability assumptions of the used algorithm (i.e., the “bootstrap world” closely approximates the real world) Kohavi et al. (1995).

While the bootstrapping technique effectively reuses the dataset examples, it maintains a definite segregation between train and validation examples. The enlarged training set is due to duplicate examples, and does not use new information for training. Furthermore, the number of instances sampled in the bootstrap procedure is not interchangeable with , the sample probability of our method. While sampling more instances in the bootstrap method would effectively create a larger training set, the validation set’s size will necessarily decrease. Contrary to this, our method does not reduce the size of the validation set. Nevertheless, one could conceive of a bootstrap procedure which would essentially achieve the same type of objective as ours. We leave this as a future direction for future work.

Thresholdout. The practice of data analysis is an adaptive process. In practice, new hypotheses are proposed after seeing the results of previous ones, models are chosen and tuned on the basis of the obtained results, and datasets are shared and reused. This adaptive process can be harmful, in particular, when subject to model selection Dwork et al. (2015). Threshholdout (Reusable Holdout Method) Dwork et al. (2017) is a method of coping with adaptivity by providing means of reliably verifying the results of an arbitrary adaptive data analysis process. As in standard analysis, data is split into a training set and a validation set. The algorithm ensures that the validation set maintains the essential guarantees of “fresh data” over the course of many estimation steps by accessing the validation set only through a suitable differentially private algorithm.

Thresholdout is an important tool for increasing the reliability of validation data in an adaptive process. Our method, on the other hand, considers a non-adaptive setting where a model is selected in a single stage. It can thereby be integrated with Thresholdout to ensure the validity of the model selection in an adaptive manner while using part of the validation data in each step of the training procedure.

Train after Test. Practitioners often follow a common approach of using pure validation data for model selection and then training on the full dataset (including the validation set examples) once a model has been chosen. Such a process is expected to introduce an optimistic bias into the performance estimates Cawley & Talbot (2010)

. For example, when minimizing the cross-validation performance, if the cross-validation statistic has a non-zero variance there is the possibility of over-fitting the model selection criterion. Figure

3 depicts an example in which this happens when selecting an optimal learning rate multiplier in the DenseNet architecture.

In order to avoid bias in performance estimates, one should potentially consider penalizing the validation statistic. By sampling from the validation, we allow the use of the validation examples while effectively regularizing this effect. In addition, our method contains a range of values, , which allows for a continuous adjustment of this inherent trade-off.

Data Dredging. “Selective Reporting” (p-hacking, data dredging) occurs when researchers try out several statistical hypotheses and then selectively report those that produce significant results Nuzzo (2014); Head et al. (2015)

. Some examples include dropping outliers post-analysis, reporting only those variables which produce significant results, presenting results on specific data which yields significant results and adapting analysis to improve test results. More related to our work is an approach in which the training set size is increased until the best result is reached.

Our method departs from these practices in two factors. First, our method is not adaptive. That is, we do not use previous validation selections to select further models. Second, we do not use test performance scores in our selection process. Our choice of relies solely on validation scores, with the choice in the knee region of the validation plot, see Figure 3.

6 Conclusion

In this paper, we demonstrated the effectiveness of using the validation set for training. While the immoderate use of the validation set for training may potentially introduce a severe form of selection bias, our sampling procedure allows for a continuous, controlled balance between performance and model selection bias. Furthermore, we have shown that stable algorithms are less prone to overfit the model selection process, with experiments showing that even severe sampling does not bias selection. We provide a per-batch sampling procedure for online algorithms and offer a simple heuristic for choosing a value of

at the knee of the validation plot, a point for which model selection bias and performance are balanced. Finally, our results on DenseNet suggest that our sampling procedure can be used in state-of-the-art deep networks, achieving significantly improved performance.

References

  • Abadi et al. (2016) Abadi, Martín, Barham, Paul, Chen, Jianmin, Chen, Zhifeng, Davis, Andy, Dean, Jeffrey, Devin, Matthieu, Ghemawat, Sanjay, Irving, Geoffrey, Isard, Michael, et al. Tensorflow: A system for large-scale machine learning. In OSDI, volume 16, pp. 265–283, 2016.
  • Altman (1992) Altman, Naomi S. An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3):175–185, 1992.
  • Bousquet & Elisseeff (2002) Bousquet, Olivier and Elisseeff, André. Stability and generalization. Journal of machine learning research, 2(Mar):499–526, 2002.
  • Cawley & Talbot (2010) Cawley, Gavin C and Talbot, Nicola LC. On over-fitting in model selection and subsequent selection bias in performance evaluation. Journal of Machine Learning Research, 11(Jul):2079–2107, 2010.
  • Devroye & Wagner (1979) Devroye, Luc and Wagner, Terry. Distribution-free performance bounds for potential function rules. IEEE Transactions on Information Theory, 25(5):601–604, 1979.
  • Dwork et al. (2015) Dwork, Cynthia, Feldman, Vitaly, Hardt, Moritz, Pitassi, Toniann, Reingold, Omer, and Roth, Aaron. The reusable holdout: Preserving validity in adaptive data analysis. Science, 349(6248):636–638, 2015.
  • Dwork et al. (2017) Dwork, Cynthia, Feldman, Vitaly, Hardt, Moritz, Pitassi, Toniann, Reingold, Omer, and Roth, Aaron. Guilt-free data reuse. Communications of the ACM, 60(4):86–93, 2017.
  • e Elissee (2000) e Elissee, Andr. A study about algorithmic stability and their relation to generalization performances. 2000.
  • Efron (1982) Efron, Bradley. The jackknife, the bootstrap, and other resampling plans, volume 38. Siam, 1982.
  • Elisseeff et al. (2003) Elisseeff, André, Pontil, Massimiliano, et al. Leave-one-out error and stability of learning algorithms with applications. NATO science series sub series iii computer and systems sciences, 190:111–130, 2003.
  • Freedman & Freedman (1983) Freedman, David A and Freedman, David A. A note on screening regression equations. the american statistician, 37(2):152–155, 1983.
  • Geisser (1975) Geisser, Seymour. The predictive sample reuse method with applications. Journal of the American statistical Association, 70(350):320–328, 1975.
  • Head et al. (2015) Head, Megan L, Holman, Luke, Lanfear, Rob, Kahn, Andrew T, and Jennions, Michael D. The extent and consequences of p-hacking in science. PLoS biology, 13(3):e1002106, 2015.
  • Hearst et al. (1998) Hearst, Marti A., Dumais, Susan T, Osuna, Edgar, Platt, John, and Scholkopf, Bernhard. Support vector machines. IEEE Intelligent Systems and their applications, 13(4):18–28, 1998.
  • Huang et al. (2016) Huang, Gao, Liu, Zhuang, Weinberger, Kilian Q, and van der Maaten, Laurens. Densely connected convolutional networks. arXiv preprint arXiv:1608.06993, 2016.
  • Kohavi et al. (1995) Kohavi, Ron et al. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Ijcai, volume 14, pp. 1137–1145. Montreal, Canada, 1995.
  • Krizhevsky & Hinton (2009) Krizhevsky, Alex and Hinton, Geoffrey. Learning multiple layers of features from tiny images. 2009.
  • LeCun (1998) LeCun, Yann.

    The mnist database of handwritten digits.

    http://yann. lecun. com/exdb/mnist/, 1998.
  • Mukherjee et al. (2006) Mukherjee, Sayan, Niyogi, Partha, Poggio, Tomaso, and Rifkin, Ryan. Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Advances in Computational Mathematics, 25(1):161–193, 2006.
  • Nuzzo (2014) Nuzzo, Regina. Scientific method: statistical errors. Nature News, 506(7487):150, 2014.
  • Quionero-Candela et al. (2009) Quionero-Candela, Joaquin, Sugiyama, Masashi, Schwaighofer, Anton, and Lawrence, Neil D. Dataset shift in machine learning. The MIT Press, 2009.
  • Refaeilzadeh et al. (2009) Refaeilzadeh, Payam, Tang, Lei, and Liu, Huan. Cross-validation. In Encyclopedia of database systems, pp. 532–538. Springer, 2009.
  • Schaffer (1993) Schaffer, Cullen. Selecting a classification method by cross-validation. Machine Learning, 13(1):135–143, 1993.
  • Shalev-Shwartz & Ben-David (2014) Shalev-Shwartz, Shai and Ben-David, Shai. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
  • Shalev-Shwartz et al. (2009) Shalev-Shwartz, Shai, Shamir, Ohad, Sridharan, Karthik, and Srebro, Nathan. Learnability and stability in the general learning setting. 2009.
  • Stone (1974) Stone, Mervyn. Cross-validatory choice and assessment of statistical predictions. Journal of the royal statistical society. Series B (Methodological), pp. 111–147, 1974.

7 Appendix

7.1 Proof of Main Theorem

Proof.

We start by proving two standard technical lemmas.

Lemma 1.

Suppose OAROS stability holds with . Then

Proof.

We have that

Then it is enough to show that

then we are done by definition of OAROS stability and Jensen’s inequality.
Denote . Since and are drawn i.i.d from we have

On the other hand,

The proof is complete since . ∎

Lemma 2.

Suppose OAROS Stability holds with and also

(4)

Then

Proof.

We have that

In the first inequality we used the result of Lemma 1, and in the final inequality we used OAROS stability and the assumption in Equation (4). ∎

Continuing with the proof of the theorem, we restate the theorem for the ERM from the Preliminaries Section:

Theorem 2.

Let be a learning algorithm with OAROS stability rate , then

We continue with the proof of the theorem. By Equation (1) it holds that

Then, using Theorem 2,

We now have that

By OAROS Stability this means

Then, by Lemma 2

and the proof is complete. ∎

7.2 Validation Stability Bound

Suppose our training set and validation set are both sampled from the same distribution with . Furthermore, let . To show that we can sample from the validation set without over-fitting, we want to show that with probability at least :

(5)

Then, if the validation loss doesn’t change much, the order in which we pick the best model won’t change as well.
We consider a weaker formulation to Equation (5):

(6)

Let’s see that Equation (5) follows immediately from (6) using Markov inequality. From (6) and Jensen’s inequality we have that

And recall that using Markov’s inequality for R.V. with we have that with probability at least , . This gives us (5) for .

7.3 Linear Regression Results

We fit a 1-dimensional cubic function using noisy sampled examples and polynomial features. The dataset of 15 sampled points in total was divided to 10 examples for training and 5 examples for validation (sampled from the noisy distribution). The validation set was used to select the optimal order of polynomial to fit, with values 1, 2, 3. Similar to the the k-NN experiment, we tested the effect of biased validation and test sets on our sampling procedure by introducing artificial bias to the validation and test sets separately. The biased functions are drawn in Figure 4.

Figure 4: Plot shows function used for regression. Black line shows line used for sampling training set. Red lines show biased lines used for creating biased validation and test sets.
Figure 5: Validation and test accuracies for the presample procedure using linear regression. Top graphs show results for unbiased validation and test sets, while bottom graphs show results for biased validation and test sets.

Figure 5 shows mean error rates of the validation and test sets for different choices of polynomial degrees. Colors of the contour map range from black to blue to white, with black depicting lower accuracy scores, and white depicting highest accuracy. The x-axis of the map shows different values with ranging from to . Every point on the grid shows a result of an experiment with a given and degree . The plots show that an increase in brought to a uniform decrease in validation error, thereby allowing for a correct choice of model (even when the whole validation set was used for training). Error rates monotonically decrease in accordance to the sampling probability. These become most significant in the case of biased validation and test sets, where sampling is essential for the training process.