Whether for sales prediction, fraud detection or treatment recommendation, machine learning (ML) models are widely used in diverse application contexts. Just as there is a plethora of application problems, there is an evergrowing variety of ML algorithms aiming to provide best possible solutions. Thus, the main issue of applying ML is identifying the algorithm which performs best at the task at hand. However, as the famous no free lunch theorem (wolpert) predicates, no single algorithm performs best at all tasks. The problem of algorithm selection is aggrevated by the fact that most ML methods have a set of meta parameters (also called hyperparameters) whose optimal choice is problem-specific.
Usually problem-optimal choices for hyperparameters are derived from a hyperparameter tuning process aimed at finding parameter settings that minimize the generalization error, i.e., the expected loss on unknown data from the same data generating process (DGP). However, the true generalization error is unknown and can only be estimated, e.g., using resampling methods such as k-fold cross-validation or bootstrapping. Thus, minimizing the generalization error is restricted to minimizing an (unbiased) estimate of it. In theory, this poses a stochastic optimization problem which in practice is commonly approached through heuristically comparing a set of candidate settings from a pre-specified parameter search space. These settings can either be generated bygrid or random search. The candidate configuration which minimizes the resampling error (e.g., measured by the misclassification rate or MSE) then results as the „optimal ” setting. Within the context of stochastic optimization, the resampling error is equivalent to the stochastic target function and the resampling strategy ultimately describes a repeated evaluation in the same parameter.
A key advantage of the random search algorithm is the possibility of parallelizing model evaluations. However, single evaluations within the resampling can still lead to high computational efforts depending on the learner and the dimensionality of the dataset. As such, it would be desirable to stop the evaluation process for parameter settings whose subpar quality becomes already apparent after a few evaluation steps. An early stopping could prevent redundant computations and save potentially a lot of run time.
The idea of early stopping is also at the core of statistics sequential test theory. Contrary to regular statistical testing with a fixed sample size , sequential tests dictate a process in which the decision to reject or accept the hypothesis, or to continue sampling is determined at each sampling step anew. Therefore, sequential tests are especially useful when sampling is costly and being able to form a decision on as few observations as possible is desirable. Furthermore, since both, Type I and II error are controlled for, sequential tests allow for treating and equally important whereas regular statistical tests only allow for rejecting , but not deciding for it.
The aim of our work is, thus, to study the feasibility of employing sequential tests during hyperparameter tuning to save evaluations and computation time. One main requirement for sequential tests is the existence of a potentially endless DGP to generate new samples. This seems to be met in hyperparameter tuning, since the DGP is represented by the resampling which, using (pseudo) random numbers, can in principle, generate arbitrarily many training and test samples from the original dataset. Many sequential tests are also based on some parametric assumptions. However, in the context of hyperparameter tuning it is not evident what kind of parametric properties can be assumed for the error estimates. As such, this necessitates an analysis of resampling error distributions beforehand to potentially derive at least approximate parametric properties.
If this succeeds, the question then becomes whether a tuning algorithm based on such a sequential test would be able to compete with the regular random search and other efficient tuning algorithms. This leads to the following three research questions in our work:
Which distribution family can be assumed for resampling errors?
How can a sequential statistical test for hyperparameter tuning problems be constructed?
How does a hyperparameter tuning approach using a sequential statistical test perform compared to a normal random search?
In the next section, we will give a brief introduction to the general machine learning scheme and how hyperparameter tuning is incorporated, especially to fix some notation. In section three, we will look at the first research question and try to find a sound distribution family for resampling errors. Section four again is a theoretical chapter, in which a sequential test procedure is introduced, and a sequential random search using this test procedure is constructed. In section five we conduct some experiments in which we show, that our sequential random search achieves comparable results to a usual random search, however, using significantly less resampling iterations. The paper is wrapped up in section six.
2 Hyperparameter Optimization
The basic object of machine learning i modeling a functional context
between a vector ofvariables and a target variable . Since the true is mostly unknown, a (machine) learning method is used to determine an approximation that describes as ”good” as possible.
This is done in the case of supervised learning based on an annotated learning datasetconsisting of pairs of observations of type where are the influence variable values of the th observation from the learning dataset and is the corresponding (true) target variable value. The approximation is determined on the so-called training data consisting of pairs of observations of type . The goodness of
is assessed using a loss function. A common choice of loss function in the regression case would be the quadratic (also called Gaussian) loss function (hastie, p. 18)
and in the classification case the 0-1 loss function (hastie, p. 221) with
The principal goal is to determine an approximation which minimizes the generalization error (the expected loss over all possible data), i.e. (cf. bischl, p. 251f.)
However, since the distributions of and are usually unknown in practice and only finitely many data points are available, is instead obtained by optimizing the estimated generalization error on a test data set consisting of pairs of observations , i.e.
Using the above loss functions, the estimation of the generalization error then yields the two performance measures MSE (Mean Squared Error) and MMCE (Mean Misclassification Error) with (cf. ISLR, pp. 29 u. 37):
Theoretically, it would be possible to use the same data to learn and to validate the model, i.e. , i.e. . However, such an approach is problematic in that the model is determined such that it explains the training data as well as possible. Thus, learning and validating on the same data leads to biased, optimistic error rates (see hastie, pp. 228ff.).
In general, it is desirable to predict unknown observations as well as possible with the model. Therefore, only observations that were not already part of the training data should be used for the validation of the model: . Usually, resampling methods are used to repeatedly generate training and test datasets and , from an original dataset with .
One common example is Bootstrapping, where is obtained by drawing observations with replacement from (i.e., some observations will be drawn more than once), and . For an overview of other resampling methods as well as application recommendations, see bischl. All resampling methods have in common that they calculate random partitions and . Hence, values of the loss functions whose calculation is based on such a partition are random numbers.
An additional complexity that often arises in determining an optimal model is that many learning methods have additional meta-parameters , so-called hyperparameters, in addition to the internally determined model parameters, which must be determined a priori and whose optimality is problem-specific. Thus, for a fixed the original optimization problem is expanded by the outer optimization problem (cf. bergstra, p. 282)
As before, this is a stochastic optimization problem whose solution, due to lack of knowledge of the distributions of and and finitely many data, can only be approximated by
This optimization problem is also called the hyperparameter tuning problem. All tuning algorithms solve this problem function in a similar way: Multiple values are proposed, the belonging value of the loss function is estimated using a resampling procedure and finally, the with the best estimated value of is used:
The optimization algorithms differ in how the is selected: Some algorithms propose only a single large batch (random search, grid search), some other algorithms perform complicated sequential procedures in order to propose a sequential list of that are as good as possible (racing, evolutionary and model-based procedures).
However, in the end, all these algorithms come down to the same basic operation: The performance of two settings has to be compared, and the better setting has to be determined. Usually, this is done in a rather simple way: The entire resampling procedure is performed for both and and the setting with the better value of is declared the winner. Hereby, the loss function is treated as if it is a deterministic function. This is achieved by using the same fixed partition and , for each setting .
Although this approach works pretty well in practice, from a theoretical view it is not 100% correct. Both and
are random variables, and we want to find out which of these two random variable is smaller (or larger, depending on the loss function). In statistics, this is usually reduced to the question, which of the two random variables has the smaller expected value. Since the mean value
is an estimator for the expected value, the usual approach is not entirely wrong. From a statistical point of view, however, a statistical test should be used with the null hypothesis
A classical statistical test procedure has two major disadvantages: First, it handles the two hypothesis a-symmetrically: It is only possible to reject and decide for . It is not possible to decide for . This disadvantage could be neglected by performing a two-sided test to find a difference, and the sign of the difference to decide for one of the hypothesis. In this case, however, there would be the possible outcome that two settings are indistinguishable with respect to , which is again not desireable. Actually, all of this is the reason why in usual tuning approaches, only the mean values are compared
The other disadvantage is the fixed number of observations needed. However, in at least some situations it may be possible after seeing just two observations to come to a decision. For example, consider and . Here, after seeing the first two observations, a significant difference between the vectors is obvious, and it is not needed to consider the remaining ones. (Of course, as always in statistics, there is a chance of being wrong.) Especially in the context of hyperparameter tuning, when every evaluation of is time consuming, since an entire model fit has to be performed, it is desired to use as less evaluations as possible. On the other hand, for two settings with pretty similar values of , it might be desirable to perform some extra evaluations, in order to be more sure about the made decision.
Both of these disadvantages are neglected by a sequential test: It can decide for both and , it will always make a decision in the end, and it will do so in a minimal amount of evaluations of . As always with statistical tests, it is both possible to construct a parametric and a non-parametric one, with the usual advantages and disadvantages. Here, we decided to construct a parametric test, hence, at first, we need to find a fitting parametric distribution family in the next chapter, before we can define a sequential test procedure in the chapter after.
3 Distribution of the loss function
In this paper, five classification and five regression problems are considered, as well as four different machine learning methods each. Table 1 gives a more detailed overview of the classification data sets and Table 2 of the regression data sets respectively. Most of the data sets are taken from openML, insurance from insurance, diamond from the R-package ggplot2 (ggplot) and wage from R-package ISLR (ISLR).
|dataset||class 0||class 1||discrete||numeric|
|insurance||1 338||3||3||1 122||13 270||63 770|
|diamond||2 700||3||6||326||4 038.40||18 788|
The learning methods used are classification and regression trees (rpart: rpart
), random forest (ranger: ranger
), XGBoost (xgboost: xgboost
), and logistic and linear regression with „elastic net” regularization (glmnet: glmnet). Table 3 lists the considered hyperparameters and corresponding search spaces for each learning method. Here, for rpart, the parameter maxdepth denotes the maximum allowed depth of the tree and cp denotes the complexity parameter, which specifies by how much a split must contribute to the improvement of the fit so that the corresponding subtree is not pruned. For the Random Forest (ranger), mtry describes the number of variables randomly drawn as split candidates, and sample.fraction and replace describe the fraction of observations used for each tree model and whether or not they are drawn with replacement. For xgboost, nrounds denotes the maximum number of iterations, eta the learning rate, and max_depth the maximum depth of the trees used. For glmnet, alpha regulates the mixture parameter of the Elastic Net regularization and lambda the degree of penalization. For the remaining hyperparameters of the individual methods, which are not subject of the optimization, the respective default settings are used.
All calculations and simulations in this work are performed in R (version 3.6.3; R2020). For any aspects of the machine learning process, i.e. model fitting, validation, tuning, etc., the R package mlr (mlr) is used, which provides a unified interface for running machine learning experiments in R.
Simulation of the error measures is performed as follows: For each combination of data set and machine learning method, ten random parameter settings are generated and validated using 1 000 resampling iterations, where Bootstraping is used as resampling strategy. The same data set partitions are used for each data set and learning procedure. The error measures considered are Mean Misclassification Error (MMCE) in the classification case and MSE in the regression case, hence, 1 000 observations of the value of the loss function are calculated for each setting.
In order to investigate the distribution family of the loss function, multiple usual distributions are fitted to the observations via Maximum-Likelihood estimation. Afterwards the Cramér-von-Mises criterion (CvM) is calculated in order to determine the goodness of the fit (see, e.g., stephens). This is done using the R package fitdistrplus (fitdist)
. The CvM-criterion is defined as the test statistics of the Cramér-von-Mises-test. It sorts the observation values in ascending order, evaluates them with distribution function and compares the values with values that are to be expected if the observations would actually follow that distribution:
The smaller is, the better the data is fitted by the distribution. Figure 1
shows the distribution of the CvM-criterion in some artificial situations: a normal distribution is fitted to some data that is actually normal, uniform or exponential distributed. For the normal distribution (i.e. the assumed distribution corresponds to the true distribution), the CvM is always smaller than 0.5, for the uniform distribution (which is kind of similar to the normal distribution) the values are between 1 and 2, for the exponential distribution (a quite different distribution), the values are between 6 and 8. Hence, for a good fitting distribution we would expect values smaller than 1, values up to 2 may be okay as well.
It cannot be expected to find a single distribution family that is always the best, contrary, it is to be expected that the distribution differs quite a lot for different settings. However, finding a distribution family that has a good fit in many situations seems sufficient in order to build a sequential test around it. Distribution families considered are the normal distribution, the gamma distribution, the beta distribution (for classification only), and the Weibull distribution, as well as variations of these in the form of the log normal distribution, the log gamma distribution, the inverse gamma distribution, and the inverse Weibull distribution.
A problem that arises when using many of these distributions is that, as in the case of the logarithmic and inverse distribution families, the respective carrier does not contain the value 0 at all or, as in the case of the gamma distribution, has a corresponding density value of 0. This is particularly problematic in the classification context, since errors of 0 are not uncommon for certain combinations of (sub)data set and learner. An obvious solution is to shift the data with an additive constant . However, it must be noted that the distributions in question are generally not ”invariant” to such shifts, i.e., the distribution family is usually not preserved. Therefore, the goodness of fit can sometimes depend strongly on the choice of , especially when the observed errors tend to be small, as is the case in classification, where the errors range between 0 and 1. To account for this, the fit of the distribution families is calculated for different values of , and only the results for the best are reported for each distribution family.
For the regression context, this problem is less relevant, since on the one hand MSE values of 0 only found in pathological examples and on the other hand MSE values usually have a higher magnitude compared to MMCE values and are thus less affected by small additive shifts.
Figure 2 shows the result for the classification, Figure 3 for the regression case. It turns out that especially the distributions of the Weibull type seem to be comparatively unsuitable for modeling both MMCE and MSE distributions. For all other distribution families, the CvM criterion takes values that are mostly smaller than 1, and in more than 75 even smaller than 0.5. This fit is by far not as good as fitting a normal distribution to actually normal distributed data, but it is far better than fitting a normal distribution to actually uniform distributed data (compare Figure 1). The one exception, not as bad as the Weibull distribution, but notably worse than the other distribution families is the normal distribution in the regression case. For the other distribution families, a relatively homogeneous picture emerges. All have a quite good (not perfect, but quite good) fit to the data and achieve comparable results. This result is confirmed when looking at other criteria than the CvM criterion (respective result plots are omitted here). Hence, from the view of the distribution fit, the choice of the distribution family is not crucial, as long as one does choose along the well performing distribution families.
4 Designing a Sequential Random Search
The main goal of our modified random search is reducing the evaluation steps needed while maintaining high performing solutions. As computing resampling errors can be viewed as a sequential process, we regard sequential statistical tests as a natural fit for this kind of situation.
For a general overview on sequential statistical tests see, e.g., ghosh. An essential class of sequential tests are Sequential Probability Ratio Tests
Sequential Probability Ratio Testsby wald. In its classic form, the test for is described by the following procedure:
Having sampled observation in step calculate the test statistic
where denotes the density function corresponding to . Then,
if , terminate and accept ,
if , terminate and accept ,
if , continue and sample a new observation
where define the so-called continuity region
of the test and are chosen such that the type I and type II errors are controlled for pre-specifiedand values. Our first simulation study indicated that mostly flexible multi-parameter distribution families are suitable for modelling the loss. We decide to use a log-normal distribution, since in comparison to the beta and gamma distribution families, the log-normal distribution offers the advantage of performing a location test based on only one parameter since the median of a log-normally distributed variable only depends on . A further advantage is the possibility to use test methods based on normality assumptions after a logarithmic transformation of the original data.
However, due to the presence of a nuisane parameter (), regular SPRTs can not be applied. Thus, we use a Sequential Likelihood Ratio Test (SLRT) proposed by ghosh for the sequential Behrens-Fisher-problem of two normally distributed i.i.d. random variables and where all parameters are unknown and one wants to test
with type I error rateand type II error rate . For this test, ghosh proposes the following continuity region:
where and denote the mean and and
the empirical variance of the respective sample until the-th observation. Since it holds that
for and , the test above is practically a test of the relative difference between the medians of the two loss distributions (assuming log-normality). However, this is only true in the regression case, since an additive shift of the data is required for classification tasks. In theory, one could also use a two-stage procedure instead of the SLRT, but this would require a pre-specified number of evaluations for the variance estimation alone. To keep the number of evaluation steps as small as possible, we therefore opt for the SLRT.
The general idea now is to combine the advantages of the regular random search algorithm with the benefits of early stopping from the SLRT. In each iteration, a single new random hyperparameter setting is proposed and used to estimate the corresponding value of the loss function using a resampling. However, instead of using a fixed number of, let us say, ten resampling iteration, the resampling is continued until a statistically sound decision can be made: The new setting is compared to the current best setting until a significant difference between those two settings is found, and the better setting is kept as the new best setting. Hereby, always the same amount of evaluations of the loss function are used. For the new setting, a new observation of the loss function has to be calculated every time using a resampling, for the current best setting prior evaluations are reused if available.
To prevent infinite run times, a maximum number of evaluations per setting should be defined and the test is forced to make a decision between the two settings if this maximum amount is reached, in case of doubt, the setting with the lower estimated loss function value is used. Pseudo-Code describing this Sequential Random Search procedure in detail can be found in Algorithm 1. Here, the function generateConfig() can be any function that generates a new hyperparameter setting, in the most simple case, uniform sampling in the search space can be used. The function generateConfig() evaluates the performance of a single hyperparameter setting, using an arbitrary resampling procedure, for example a single iteration of Bootstrapping As the termination criterion, usually a maximum number of iterations or computation time should be used.
For the implementation of generateConfig() two generally different approaches can be considered: One possibility is to create a resampling instance with different data set partition samples , at the beginning of the optimization, on which all configurations are evaluated from then on, i.e. in the -th evaluation step all configurations are validated on the same bootstrap sample. This allows immediate comparability of configurations (on the single sample) and represents the usual approach to regular random search. However, this approach does not do justice to the stochastic nature of the optimization problem and therefore the second approach is to always draw a new bootstrap sample for each evaluation of a configuration, i.e., two configurations are validated on different bootstrap samples in the -th step. In both cases, performance values already obtained for opt.config() should be reused to save additional evaluations.
5 Simulation Study
We will now compare the SQRS with a regular random search in an simulation study designed to ensure maximum comparability between the two algorithms. Using the same learners, parameter search spaces and datasets as in Tables 1 2 and 3, we generate 50 random configurations of hyperparameters for each combination of learner and dataset and validate their performance using a bootstrap resampling with ten iterations. To make the resampling results comparable, we use the same resampling instance for both algorithms. We perform 100 replications where the tuning process is carried out in a sequential manner, i.e. without parallelizing computations.
For the random search, we now simply look for the parameter setting with the minimal MMCE and MSE respectively and use this setting as the random search result. The sequential random search will now operate on exactly the same data: generateConfig() will not generate new random hyperparameter setting, but it will return the same 50 random configuration as used by the random search, but in a random order. When comparing the current best and the new configuration, SQRS uses exactly the same dataset partitions as random search, and, hence, the values of the loss function calculated during the random search are reused. The only real difference to the random search is, that the SQRS will not always use all ten iterations of the resampling, but it is allowed to discard a candidate setting as soon as the sequential tests shows that it is worse than the current best setting.
For the parameters of the SQRS, we use an additive shift of in classification and in regression tasks. As specified by the resampling instance, the maximum number of evaluations is . We vary the choice for parameters and between and , and for between and .
We aim to answer two question: First, can the SQRS algorithm achieve a comparable accuracy, i.e., does it find the same optimal hyperparameter setting? And second, how much less function evaluation does it need to do so?
First, we investigate how often the sequential random search leads to the same result as the random search (RS). Table 5 presents the proportion of identical solutions for the different considered values of , , and . In the classification case, the SQRS chooses the same parameter configuration approximately between 79% and 90% of the time. In the regression case, it does so 87% to 90% of the time. The more conservative the test becomes in terms of smaller (absolute) values for each test parameter, the more often the SQRS makes the same decision. It seems that the procedure tends to react more sensitively to changes in the test parameters chosen here in the classification case compared to the regression case.
If we consider the cases in which the SQRS does not find the optimal solutions, the solutions found by the SQRS do not perform considerably worse. In Figure 4 the rations of the performance of random search and the performance of SQRS are shown. A ratio of 1 corresponds to identical performances, ratios larger than 1 indicate a better performance of random search. Per definition of our experiment, SQRS cannot find a better setting than random search, hence, rations smaller than 1 are not possible here. For a better clarity, we omit all datapoints with a ratio of 1 in the plot. It can be seen that both in classification and regression in most experiments the solution found by SQRS is only slightly worse than the solution of random search.
Figure 5 (classification) and Figure 6 (regression) show the number of evaluations SQRS needed to reach its top performance compared to the total number of random search iterations. For classification in setting A, the quotient of the required evaluations for SQRS and RS is 0.476 at the median, 0.558 in Setting B, 0.608 in Setting C, and 0.712 in Setting D. In the most liberal case (A), sequential random search required fewer evaluations than random search 99.9% of the time and in 55.6% of the cases it requires at most half as many evaluations. Remember, in this case, 80% of the runs reached the same performance as with random search was reached, and and most other runs were only 10% worse at most. In the most conservative setting (D) the SQRS still requires fewer evaluations than the maximum possible in 90.8% of the cases, and in 24.5% of the cases, the maximum number of evaluations possible is half of the possible evaluations are performed.
In the regression settings, the results are even more promising. The quotient of the required evaluations for the sequential and the ordinary median is 0.324 for Setting A, 0.360 for Setting B, 0.388 for Setting C, and 0.456 for Setting D. In all scenarios, the SQRS never exhausted the maximum number of the maximum evaluation number. In the liberal setting (E) in 71.1% and in the conservative setting (H) in 55.3% of the cases, at most half as many evaluations were than for the random search. Remember that here in all setting in around 90% of the runs the same performance as for random search was reached.
In this work, we analysed the feasibility of employing sequential statistical tests during the hyperparameter tuning process to save computational effort and time. We aimed at answering three main research questions. The first research question asked which distribution family could best be used to approximate the distribution of resampling errors.
We answered this question with a small simulation study: We picked multiple different datasets as well as different machine learning methods, both from the fields of classification and regression. We empirically calculated the distribution of the resampling errors in these situations and fitted multiple different distribution families. Overall, typical flexible distribution families achieved comparably good fits: The gamma, the inverse gamma, the log gamma and the log normal distribution. We recognize that this approach is purely empirical and requires a more thorough analysis but it provided us with a first indication to continue our work upon.
Our second research question pertained to the construction of a tuning algorithm using a sequential test. The results to the first research question indicated that multiple distribution families seemed fitting. For practical reasons, we decided for the log-normal family. Using a sequential test by ghosh, we implemented a sequential variant of the random search (abbreviated SQRS). The main difference between the SQRS and regular random search was the early stopping possibility of the former. After each evaluation step, the resampling error samples of two configurations were compared using the sequential test procedure. When a terminating decision could not be made before reaching the maximum number of permitted evaluation steps, the procedure selected the parameter setting leading to the lower resampling error.
Having implemented the SQRS, our focus then turned to our third and final research question of how a hyperparameter tuning approach using a sequential statistical test would perform compared to a normal random search. To study this, we performed a simple experiment in which we used a regular random search for some hyperparameter optimization problems. Afterwards, the SQRS had to tackle the same problems under identical conditions. We found that by using the sequential testing procedure instead of doing a full resampling for each comparision, the SQRS was able to speed up the procedure quite a lot without leading to a considerable performance loss.
This indicates to us that the inclusion of a sequential statistical test procedure within a hyperparameter tuning algorithm is a promising approach: The reached best performance of our SQRS was comparable to a normal random search in nearly all cases, but the number of resampling iterations to reach this performance was halved. However, our comparison was targeted at maximum comparability and both algorithms were forced to operate under the same laboratory conditions using an identical set of candidate settings. What is missing are real application experiments in which the actual run time advantage of SQRS is analyzed. Since such experiments are quite time consuming, they have been omitted in this first iteration.
Moreover, even if random search is still used for many hyperparameter optimization problems, at least some more advanced and better tuning algorithms do exist. All of them rely on their ability to compare two hyperparameter setting using a resampling, so our sequential testing procedure could be incorporated into most of those algorithms. There are also other algorithms that try to reduce the number of resampling iterations needed by eliminating bad settings early, FRacing has to be mentioned in the first place here. These algorithms follow a different (statistically less sound) way of eliminating bad settings, and it still has to be analyzed whether the additional theoretical work of SQRS is worth it.