1 Introduction
Machine learning (ML) algorithms have been widely used in many applications domains, including advertising, recommendation systems, computer vision, natural language processing, and user behavior analytics
MLsc . This is because they are generic and demonstrate high performance in data analytics problems. Different ML algorithms are suitable for different types of problems or datasets SOAML . In general, building an effective machine learning model is a complex and timeconsuming process that involves determining the appropriate algorithm and obtaining an optimal model architecture by tuning its hyperparameters (HPs) AMLSC .Two types of parameters exist in machine learning models: one that can be initialized and updated through the data learning process (e.g.
, the weights of neurons in neural networks), named model parameters; while the other, named hyperparameters, cannot be directly estimated from data learning and must be set before training a ML model because they define the architecture of a ML model
2Ps . Hyperparameters are the parameters that are used to either configure a ML model (e.g., the penalty parameterin a support vector machine, and the learning rate to train a neural network) or to specify the algorithm used to minimize the loss function (
e.g., the activation function and optimizer types in a neural network, and the kernel type in a support vector machine)
parameters .To build an optimal ML model, a range of possibilities must be explored. The process of design the ideal model architecture with an optimal hyperparameter configuration is named hyperparameter tuning. Tuning hyperparameters is considered a key component of building an effective ML model, especially for treebased ML models and deep neural networks, which have many hyperparameters AMLB . Hyperparameter tuning process is different among different ML algorithms due to their different types of hyperparameters, including categorical, discrete, and continuous hyperparameters EHPO . Manual testing is a traditional way to tune hyperparameters and is still prevalent in graduate student research although it requires a deep understanding of the ML algorithms and their hyperparameter value settings ADL . However, manual tuning is ineffective for many problems due to certain factors, including a large number of hyperparameters, complex models, timeconsuming model evaluations, and nonlinear hyperparameter interactions. These factors have inspired increased research in techniques for automatic optimization of hyperparameters; socalled hyperparameter optimization (HPO) BBHPO . The main aim of HPO is to automate hyperparameter tuning process and make it possible for users to apply machine learning models to practical problems effectively AMLSC . The optimal model architecture of a ML model is expected to be obtained after a HPO process. Some important reasons for applying HPO techniques to ML models are as follows AMLB :

It reduces the human effort required, since many ML developers spend considerable time tuning the hyperparameters, especially for large datasets or complex ML algorithms with a large number of hyperparameters.

It improves the performance of ML models. Many ML hyperparameters have different optimums to achieve optimal performance in different datasets or problems.

It makes the models and research more reproducible. Only when the same level of hyperparameter tuning is implemented can different ML algorithms be compared fairly; hence, using a same HPO method on different ML algorithms also helps to determine the most suitable ML model for a specific problem.
It is crucial to select an appropriate optimization technique to detect optimal hyperparameters. Traditional optimization techniques may not be suitable for HPO problems, since many HPO problems are not convex or differentiable optimization problems, and may result in a local instead of a global optimum ASHPO . Gradient descentbased methods are a common type of traditional optimization algorithm that can be used to tune continuous hyperparameters by calculating their gradients GBad . For example, the learning rate in a neural network can be optimized by a gradientbased method.
Compared with traditional optimization methods like gradient descent, many other optimization techniques are more suitable for HPO problems, including decisiontheoretic approaches, Bayesian optimization models, multifidelity optimization techniques, and metaheuristics algorithms EHPO . Apart from detecting continuous hyperparameters, many of these algorithms also have the capacity to effectively identify discrete, categorical, and conditional hyperparameters.
Decisiontheoretic methods are based on the concept of defining a hyperparameter search space and then detecting the hyperparameter combinations in the search space, ultimately selecting the bestperforming hyperparameter combination. Grid search (GS) AHPO is a decisiontheoretic approach that involves exhaustively searching for a fixed domain of hyperparameter values. Random search (RS) RS is another decisiontheoretic method that randomly selects hyperparameter combinations in the search space, given limited execution time and resources. In GS and RS, each hyperparameter configuration is treated independently.
Unlike GS and RS, Bayesian optimization (BO) BOHP
models determine the next hyperparameter value based on the previous results of tested hyperparameter values, which avoids many unnecessary evaluations; thus, BO can detect the best hyperparameter combination within fewer iterations than GS and RS. To be applied to different problems, BO can model the distribution of the objective function using different models as the surrogate function, including Gaussian process (GP), random forest (RF), and treestructured Parzen estimators (TPE) models
surrogates . BORF and BOTPE can retain the conditionality of variables surrogates . Thus, they can be used to optimize conditional hyperparameters, like the kernel type and the penalty parameter in a support vector machine (SVM). However, since BO models work sequentially to balance the exploration of unexplored areas and the exploitation of currentlytested regions, it is difficult to parallelize them.Training a ML model often takes considerable time and space. Multifidelity optimization algorithms are developed to tackle problems with limited resources, and the most common ones being banditbased algorithms. Hyperband Hyperband is a popular banditbased optimization technique that can be considered an improved version of RS. It generates small versions of datasets and allocates a same budget to each hyperparameter combination. In each iteration of Hyperband, poorlyperforming hyperparameter configurations are eliminated to save time and resources.
Metaheuristic algorithms are a set of techniques used to solve complex, large search space and nonconvex optimization problems to which HPO problems belong HumanAML . Among all metaheuristic methods, genetic algorithm (GA) GA2 and particle swarm optimization (PSO) PSODL are the two most prevalent metaheuristic algorithms used for HPO problems. Genetic algorithms detect wellperforming hyperparameter combinations in each generation, and pass them to the next generation to identify the bestperforming combination. In PSO algorithms, each particle communicates with other particles to detect and update the current global optimum in each iteration until the final optimum is detected. Metaheuristics can efficiently explore the search space to detect optimal or nearoptimal solutions. Hence, they are particularly suitable for the HPO problems with large configuration space due to their high efficiency. For instance, a deep neural network (DNN) often has a large configuration space with multiple hyperparameters, including the activation and optimizer types, the learning rate, dropout rate, etc.
Although using HPO algorithms to tune the hyperparameters of ML models greatly improves the model performance, certain other aspects, like their computational complexity, still have much room for improvement. On the other hand, since different HPO models have their own advantages and suitable problems, overviewing them is necessary for proper optimization algorithm selection in terms of different types of ML models and problems.
This paper makes the following contributions:

It reviews common ML algorithms and their important hyperparameters.

It analyzes common HPO techniques, including their benefits and drawbacks, to help apply them to different ML models by appropriate algorithm selection in practical problems.

It surveys common HPO libraries and frameworks for practical use.

It discusses the open challenges and research directions of the HPO research domain.
In this survey paper, we begin with a comprehensive introduction of the common optimization techniques used in ML hyperparameter tuning problems. Section 2 introduces the main concepts of mathematical optimization and hyperparameter optimization, as well as the general HPO process. In Section 3, we discuss the key hyperparameters of common ML models that need to be tuned. Section 4 covers the various stateoftheart optimization approaches that have been proposed for tackling HPO problems. In Section 5, we analyze different HPO methods and discuss how they can be applied to ML algorithms. In Section 6, we provide an introduction to various public libraries and frameworks that are developed to implement HPO. Section 7 presents and discusses the experimental results of using HPO on benchmark datasets for HPO method comparison and practical use case demonstration. In Section 8, we discuss several research directions and open challenges that should be considered to improve current HPO models or develop new HPO approaches. We conclude the paper in Section 9.
2 Mathematical Optimization and Hyperparameter Optimization Problems
The key process of machine learning is to solve optimization problems. To build a ML model, its weight parameters are initialized and optimized by an optimization method until the objective function value approaches a minimum value and the accuracy rate approaches a maximum value mathopt1 . Similarly, hyperparameter optimization methods aim to optimize the architecture of a ML model by evaluating the optimal hyperparameter configurations. In this section, the main concepts of mathematical optimization and hyperparameter optimization for machine learning models are discussed.
2.1 Mathematical Optimization
Mathematical optimization is finding the best solution from a set of available candidates that enables the objective function to be maximized or minimized mathopt1
. Generally, optimization problems can be classified as constrained or unconstrained optimization problems based on whether they have constraints for the decision variables or the solution variables.
In unconstrained optimization problems, a decision variable, , can take any values from the onedimensional space of all real numbers, . An unconstrained optimization problem can be denoted by mathopt2 :
(1) 
where is the objective function.
On the other hand, most reallife optimization problems are constrained optimization problems. The decision variable for constrained optimization problems should be subject to certain constraints which could be mathematical equalities or inequalities. Therefore, constrained optimization problems or general optimization problems can be expressed as mathopt2 :
(2)  
subject to  
where are the inequality constraint functions; are the equality constraint function; and is the domain of .
The role of constraints is to limit the possible values of the optimal solution to certain areas of the search space, named the feasible region mathopt2 . Thus, the feasible region of can be represented by:
(3) 
To conclude, an optimization problem consists of three major components: a set of decision variables , an objective function to be either minimized or maximized, and a set of constraints that allow the variables to take on values in certain ranges (if it is a constrained optimization problem). Therefore, the goal of optimization tasks is to obtain the set of variable values that minimize or maximize the objective function while satisfying any applicable constraints.
Many HPO problems have certain constraints, like the feasible domain of the number of clusters in kmeans, as well as time and space constraints. Therefore, constrained optimization techniques are widelyused in HPO problems
AMLSC .For optimization problems, in many cases, only a local instead of a global optimum can be obtained. For example, to obtain the minimum of a problem, assuming is the feasible region of a decision variable , a global minimum is the point satisfying , while a local minimum is a point in a neighborhood satisfying mathopt2 . Thus, the local optimum may only be an optimum in a small range instead of being the optimal solution in the entire feasible region.
A local optimum is only guaranteed to be the global optimum in convex functions convex . Convex functions are the functions that only have one optimum. Therefore, continuing to search along the direction in which the objective function decreases can detect the global optimal value. A function is a convex function if convex , for ,
(4) 
where is the domain of decision variables, and is a coefficient in the range of [0,1].
An optimization problem is a convex optimization problem only when the objective function is a convex function and the feasible region is a convex set, denoted by convex :
(5)  
On the other hand, nonconvex functions have multiple local optimums, but only one of these optimums is the global optimum. Most ML and HPO problems are nonconvex optimization problems. Thus, utilizing inappropriate optimization methods can often detect only a local instead of a global optimum.
There are many traditional methods that can be used to solve optimization problems, including gradient descent, Newton’s method, conjugate gradient, and heuristic optimization methods
mathopt1 . Gradient descent is a commonlyused optimization method that uses the negative gradient direction as the search direction to move towards the optimum. However, gradient descent cannot guarantee to detect the global optimum unless the objective function is a convex function. Newton’s method uses the inverse matrix of the Hessian matrix to obtain the optimum. Newton’s method has faster convergence speed than gradient descent, but often requires more time and larger space than gradient descent to store and calculate the Hessian matrix. Conjugate gradient searches along the conjugated directions constructed by the gradient of known data points to detect the optimum. Conjugate gradient has faster convergence speed than gradient descent, but its calculation of conjugate gradient is more complex. Unlike other traditional methods, heuristic methods use empirical rules to solve the optimization problems instead of following systematical steps to obtain the solution. Heuristic methods can often detect the approximate global optimum within a few iterations, but cannot guarantee to detect the global optimum mathopt1 .2.2 Hyperparameter Optimization Problem Statement
During the design process of ML models, effectively searching the hyperparameters’ space using optimization techniques can identify the optimal hyperparameters for the models. The hyperparameter optimization process consists of four main components: an estimator (a regressor or a classifier) with its objective function, a search space (configuration space), a search or optimization method used to find hyperparameter combinations, and an evaluation function to compare the performance of different hyperparameter configurations.
The domain of a hyperparameter can be continuous (e.g., learning rate), discrete (e.g., number of clusters), binary (e.g., whether to use early stopping or not), or categorical (e.g., type of optimizer). Therefore, hyperparameters are classified as continuous, discrete, and categorical hyperparameters. For continuous and discrete hyperparameters, their domains are usually bounded in practical applications AHPO UBO . On the other hand, the hyperparameter configuration space sometimes contains conditionality. A hyperparameter may need to be used or tuned depending on the value of another hyperparameter, called a conditional hyperparameter ASHPO . For instance, in SVM, the degree of the polynomial kernel function only needs to be tuned when the kernel type is chosen to be polynomial.
In simple cases, all hyperparameters can take unrestricted real values, and the feasible set of hyperparameters can be a realvalued dimensional vector space. However, in most cases, the hyperparameters of a ML model often take on values from different domains and have different constraints, so their optimization problems are often complex constrained optimization problems HPOpro
. For instance, the number of considered features in a decision tree should be in the range of 0 to the number of features, and the number of clusters in kmeans should not be larger than the size of data points. Additionally, categorical features can often only take several certain values, like the limited choices of the activation function and the optimizer of a neural network. Therefore, the feasible domain of
often has a complex structure, which increases the problems’ complexity HPOpro .In general, for a hyperparameter optimization problem, the aim is to obtain PSODL :
(6) 
where is the objective function to be minimized, such as the error rate or the root mean squared error (RMSE); is the hyperparameter configuration that produces the optimum value of ; and a hyperparameter can take any value in the search space .
The aim of HPO is to achieve optimal or nearoptimal model performance by tuning hyperparameters within the given budgets AMLSC . The mathematical expression of the function varies, depending on the objective function of the chosen ML algorithm and the performance metric function. Model performance can be evaluated by various metrics, like accuracy, RMSE, F1score, and false alarm rate. On the other hand, in practice, time budgets are an essential constraint for optimizing HPO models and must be considered. It often requires a massive amount of time to optimize the objective function of a ML model with a reasonable number of hyperparameter configurations. Every time a hyperparameter value is tested, the entire ML model needs to be retrained, and the validation set needs to be processed to generate a score that reflects the model performance.
After selecting a ML algorithm, the main process of HPO is as follows ASHPO :

Select the objective function and the performance metrics;

Select the hyperparameters that require tuning, summarize their types, and determine the appropriate optimization technique;

Train the ML model using the default hyperparameter configuration or common values as the baseline model;

Start the optimization process with a large search space as the hyperparameter feasible domain determined by manual testing and/or domain knowledge;

Narrow the search space based on the regions of currentlytested wellperforming hyperparameter values, or explore new search spaces if necessary.

Return the bestperforming hyperparameter configuration as the final solution.
However, most traditional optimization techniques OML are unsuitable for HPO, since HPO problems are different from traditional optimization problems in the following aspects ASHPO :

The optimization target, the objective function of ML models, is usually a nonconvex and nondifferentiable function. Therefore, many traditional optimization methods designed to solve convex or differentiable optimization problems are often unsuitable for HPO problems, since these methods may return a local optimum instead of a global optimum. Additionally, an optimization target lacking smoothness makes certain traditional derivativefree optimization models perform poorly for HPO problems AUTOMS .

The hyperparameters of ML models include continuous, discrete, categorical, and conditional hyperparameters. Thus, many traditional numerical optimization methods NumericalO that only aim to tackle numerical or continuous variables are unsuitable for HPO problems.

It is often computationally expensive to train a ML model on a largescale dataset. HPO techniques sometimes use data sampling to obtain approximate values of the objective function. Thus, effective optimization techniques for HPO problems should be able to use these approximate values. However, function evaluation time is often ignored in many blackbox optimization (BBO) models, so they often require exact instead of approximate objective function values. Consequently, many BBO algorithms are often unsuitable for HPO problems with limited time and resource budgets.
Therefore, appropriate optimization algorithms should be applied to HPO problems to identify optimal hyperparameter configurations for ML models.
3 Hyperparameters in Machine Learning Models
To boost ML models by HPO, firstly, we need to find out what the key hyperparameters are that people need to tune to fit the ML models into specific problems or datasets.
In general, ML models can be classified as supervised and unsupervised learning algorithms, based on whether they are built to model labeled or unlabeled datasets
superAM. Supervised learning algorithms are a set of machine learning algorithms that map input features to a target by training labeled data, and mainly include linear models, knearest neighbors (KNN), support vector machines (SVM), naïve Bayes (NB), decisiontreebased models, and deep learning (DL) algorithms
supervised. Unsupervised learning algorithms are used to find patterns from unlabeled data and can be divided into clustering and dimensionality reduction algorithms based on their aims. Clustering methods mainly include kmeans, densitybased spatial clustering of applications with noise (DBSCAN), hierarchical clustering, and expectationmaximization (EM); while two common dimensionality reduction algorithms are principal component analysis (PCA) and linear discriminant analysis (LDA)
sklearnbook . Moreover, there are several ensemble learning methods that combine different singular models to further improve model performance, like voting, bagging, and AdaBoost. In this paper, the important hyperparameters of common ML models are studied based on their names in Python libraries, including scikitlearn (sklearn) sklearn, XGBoost
xgboost, and Keras
keras .3.1 Supervised Learning Algorithms
In supervised learning, both the input and the output are available, and the goal is to obtain an optimal predictive model function to minimize the cost function that models the error between the estimated output and groundtruth labels. The predictive model function varies based on its model structure. With limited model structures determined by different hyperparameter configurations, the domain of the ML model function is restricted to a set of functions . Thus, the optimal predictive model can be obtained by SL :
(7) 
where is the number of training data points, is the feature vector of the th instance, is the corresponding actual output, and is the cost function value of each sample.
Many different loss functions exist in supervised learning algorithms, including the square of Euclidean distance, crossentropy, information gain, etc. SL . On the other hand, different ML algorithms generate different predictive model architectures based on different hyperparameter configurations, which will be discussed in detail in this subsection.
3.1.1 Linear Models
In general, supervised learning models can be classified as regression and classification techniques when used to predict continuous or discrete target variables, respectively. Linear regression
ML is a typical regression model to predict a target by the following equation:(8) 
where the target value is expected to be a linear combination of input features , and is the predicted value. The weight vector is designated as an attribute ’coef_’, and is defined as another attribute ’intercept_’ in the linear model of sklearn. Usually, no hyperparameter needs to be tuned in linear regression. A linear model’s performance mainly depends on how well the problem or data follows a linear distribution.
To improve the original linear regression models, ridge regression was proposed in
ridge . Ridge regression imposes a penalty on the coefficients, and aims to minimize the objective function ridgelasso :(9) 
where is the norm of the coefficient vector, and is the regularization strength. A larger value of indicates a larger amount of shrinkage; thus, the coefficients are also more robust to collinearity.
Lasso regression lasso is another linear model used to estimate sparse coefficients, consisting of a linear model with an priori added regularization term. It aims to minimize the objective function ridgelasso :
(10) 
where is a constant and is the norm of the coefficient vector. Therefore, the regularization strength is an crucial hyperparameter of both ridge and lasso regression models.
Logistic regression (LR) LoR is a linear model used for classification problems. In LR, its cost function may be different, depending on the regularization method chosen for the penalization. There are three main types of regularization methods in LR: norm, norm, and elasticnet regularization LRnorms .
Therefore, the first hyperparameter that needs to be tuned in LR is to the regularization method used in the penalization, ’l1’, ’l2’, ’elasticnet’ or ’none’, which is called ’penalty’ in sklearn. The coefficient, ’’, is another essential hyperparameter that determines the regularization strength of the model. In addition, the ’solver’ type, representing the optimization algorithm type, can be set to ’newtoncg’, ’lbfgs’, ’liblinear’, ’sag’, or ’saga’ in LR. The ’solver’ type has correlations with ’penalty’ and ’’, so they are conditional hyperparameters.
3.1.2 Knn
Knearest neighbor (KNN) is a simple ML algorithm that is used to classify data points by calculating the distances between different data points. In KNN, the predicted class of each test sample is set to the class to which most of its knearest neighbors in the training set belong.
Assuming the training set , is the feature vector of an instance, and is the class of the instance, , for a test instance , its class can be denoted by KNN :
(11) 
where is an indicator function, when , otherwise ; is the field involving the knearest neighbors of .
In KNN, the number of considered nearest neighbors, , is the most crucial hyperparameter kinknn . If is too small, the model will be underfitting; if is too large, the model will be overfitting and require high computational time. In addition, the weighted function used in the prediction can also be chosen to be ’uniform’ (points are weighted equally) or ’distance’ (points are weighted by the inverse of their distance), depending on specific problems. The distance metric and the power parameter of the Minkowski metric can also be tuned as it can result in minor improvement. Lastly, the ’algorithm’ used to compute the nearest neighbors can also be chosen from a ball tree, a kdimensional (KD) tree, or a brute force search. Typically, the model can determine the most appropriate algorithm itself by setting the ’algorithm’ to ’auto’ in sklearn sklearn .
3.1.3 Svm
A support vector machines (SVM) SVM1
is a supervised learning algorithm that can be used for both classification and regression problems. SVM algorithms are based on the concept of mapping data points from lowdimensional into highdimensional space to make them linearly separable; a hyperplane is then generated as the classification boundary to partition data points
SVMme . Assuming there are data points, the objective function of SVM is SVMme2 :(12) 
where is a normalization vector; is the penalty parameter of the error term, which is an important hyperparameter of all SVM models.
The kernel function , which is used to measure the similarity between two data points and
, can be set to different types in SVM models, including several common kernel types, or even customized kernels. Therefore, the kernel type would be a vital hyperparameter to be tuned. Common kernel types in SVM include linear kernels, radial basis function (RBF), polynomial kernels, and sigmoid kernels.
The different kernel functions can be denoted as follows SVMkernel :

Linear kernel:
(13) 
Polynomial kernel:
(14) 
RBF kernel:
(15) 
Sigmoid kernel:
(16)
As shown in the kernel function equations, a few other different hyperparameters need to be tuned after a kernel type is chosen. The hyperparameter , denoted by ’gamma’ in sklearn, is the conditional hyperparameter of the ’kernel type’ hyperparameter when it is set to polynomial, RBF, or sigmoid; , specified by ’coef0’ in sklearn, is the conditional hyperparameter of polynomial and sigmoid kernels. Moreover, the polynomial kernel has an additional conditional hyperparameter representing the ’degree’ of the polynomial kernel function. In support vector regression (SVR) models, there is another hyperparameter, ’epsilon’, indicating the distance error to of its loss function sklearn .
3.1.4 Naïve Bayes
Naïve Bayes (NB) NB1
algorithms are supervised learning algorithms based on Bayes’ theorem. Assuming there are
dependent features and a target variable , the objective function of naïve Bayes can be denoted by:(17) 
where
is the probability of a value
, andis the posterior probabilities of
given the values of . Regarding the different assumptions of the distribution of , there are different types of naïve Bayes classifiers. The four main types of NB models are: Bernoulli NB, Gaussian NB, multinomial NB, and complement NB NB2 .The maximum likelihood method is used to calculate the mean value,
, and the variance,
. Normally, there is not any hyperparameter that needs to be tuned for Gaussian NB. The performance of a Gaussian NB model mainly depends on how well the dataset follows Gaussian distributions.Multinomial NB MNB is designed for multinomiallydistributed data based on the naïve Bayes algorithm. Assuming there are features, and is the distribution of each value of the target variable , which equals the conditional probability when a feature value is involved in a data point belonging to the class . Based on the concept of relative frequency counting, can be estimated by a smoothed version of sklearn :
(19) 
where is the number of times when feature is in a data point belonging to class , and is the sum of all (). The smoothing priors are used for features that are not in the learning samples. When , it is called Laplace smoothing; when , it is called Lidstone smoothing.
Complement NB CNB is an improved version of the standard multinomial NB algorithm and is suitable for processing imbalanced data, while Bernoulli NB BNB
requires samples to have binaryvalued feature vectors so that the data can follow multivariate Bernoulli distributions. They both have the additive (Laplace/Lidstone) smoothing parameter,
, as the main hyperparameter that needs tuning. To conclude, for naïve Bayes algorithms, users often do not need to tune hyperparameters or only need to tune the smoothing parameter , which is a continuous hyperparameter.3.1.5 Treebased Models
Decision tree (DT) DT is a common classification method that uses a treestructure to model decisions and possible consequences by summarizing a set of classification rules from the data. A DT has three main components: a root node representing the whole data; multiple decision nodes indicating decision tests and subnode splits over each feature; and several leaf nodes representing the result classes n3 . DT algorithms recursively split the training set with better feature values to achieve good decisions on each subset. Pruning, which means removing some of the subnodes of decision nodes, is used in DT to avoid overfitting. Since a deeper tree has more subtrees to make more accurate decisions, the maximum tree depth, ’max_depth’, is an essential hyperparameter of DT algorithms IDSme .
There are many other important HPs to be tuned to build effective DT models DTHPsk
. Firstly, the quality of splits can be measured by setting a measuring function, denoted by ’criterion’ in sklearn. Gini impurity or information gain are the two main types of measuring functions. The split selection method, ’splitter’, can also be set to ’best’ to choose the best split, or ’random’ to select a random split. The number of considered features to generate the best split, ’max_features’, can also be tuned as a feature selection process. Moreover, there are several discrete hyperparameters related to the splitting process, which need to be tuned to achieve better performance: the minimum number of data points to split a decision node or to obtain a leaf node, denoted by ’min_samples_split’ and ’min_samples_leaf’, respectively; the ’max_leaf_nodes’, indicating the maximum number of leaf nodes, and the ’min_weight_fraction_leaf’ that means the minimum weighted fraction of the total weights, can also be tuned to improve model performance
sklearn DTHPsk .Based on the concept of DT models, many decisiontreebased ensemble algorithms have been proposed to improve model performance by combining multiple decision trees, including random forest (RF), extra trees (ET), and extreme gradient boosting (XGBoost) models. RF
RF is an ensemble learning method that uses the bagging method to combine multiple decision trees. In RF, basic DTs are built on many randomlygenerated subsets, and the class with the majority voting will be selected to be the final classification result RFour . ET ET is another treebased ensemble learning method that is similar to RF, but it uses all samples to build DTs and randomly selects the feature sets. In addition, RF optimizes splits on DTs while ET randomly makes the splits. XGBoost xgboost is a popular treebased ensemble model designed for speed and performance improvement, which uses the boosting and gradient descent methods to combine basic DTs. In XGBoost, the next input sample of a new DT will be related to the results of previous DTs. XGBoost aims to minimize the following objective function IDSme :(20) 
where is the number of leaves in a decision tree, and are the sums of the first and second order gradient statistics of the cost function, and are the penalty coefficients.
Since treebased ensemble models are built with decision trees as base learners, they have the same hyperparameters as DT models, described in this subsection. Apart from these hyperparameters, RF, ET, and XGBoost all have another crucial hyperparameter to be tuned, which is the number of decision trees to be combined, denoted by ’n_estimators’ in sklearn. XGBoost has several additional hyperparameters, including XGHP : ’min_child_weight’ which means the minimum sum of weights in a child node; ’subsample’ and ’colsample_bytree’ used to control the subsampling ratio of instances and features, respectively; and four continuous hyperparameters — ’gamma’, ’alpha’, ’lambda’, and ’learning_rate’ — indicating the minimum loss reduction for a split, , and regularization term on weights, and the learning rate, respectively.
3.1.6 Ensemble Learning Algorithms
Apart from treebased ensemble models, there are several other general ensemble learning methods that combine multiple singular ML models to achieve better model performance than any singular algorithms alone. The three general ensemble learning models — voting, bagging, and AdaBoost — are introduced in this subsection Ensemble .
Voting Ensemble is a basic ensemble learning algorithm that uses the majority voting rule to combine singular estimators and generate a comprehensive estimator with improved accuracy. In sklearn, the voting method can be set to be ’hard’ or ’soft’, indicating whether to use majority voting or averaged predicted probabilities to determine the classification result. The list of selected single ML estimators and their weights can also be tuned in certain cases. For instance, a higher weight can be assigned to a betterperforming singular ML model.
Bootstrap aggregating Ensemble , also named bagging, trains multiple base estimators on different randomlyextracted subsets to construct a final predictor bagging . When using bagging methods, the first consideration should be the type and number of base estimators in the ensemble, denoted by ’base_estimator’ and ’n_estimators’, respectively. Then, the ’max_samples’ and ’max_features’, indicating the sample size and feature size to generate different subsets, can also be tuned.
AdaBoost Ensemble , short for adaptive boosting, is an ensemble learning method that trains multiple base learners consecutively (weak learners), and later learners emphasize the misclassified samples of previous learners; ultimately, a final strong learner is trained. During this process, incorrectlyclassified instances are retrained with other new instances, and their weights are adjusted so that the subsequent classifiers focus more on difficult cases, thereby gradually building a stronger classifier. In AdaBoost, the type of base estimator, ’base_estimator’, can be set to a decision tree or other methods. In addition, the maximum number of estimators at which boosting is terminated, ’n_estimators’, and the learning rate that shrinks the contribution of each classifier, should also be tuned to achieve a tradeoff between these two hyperparameters.
3.1.7 Deep Learning Models
Deep learning (DL) algorithms are widely applied to various areas — like computer vision, natural language processing, and machine translation — since they have had great success solving many types of problems. DL models are based on the theory of artificial neural networks (ANNs). Common types of DL architectures include deep neural networks (DNNs), feedforward neural networks (FFNNs), deep belief networks (DBNs), convolutional neural networks (CNNs), recurrent neural networks (RNNs) and many more
DL1 . All these DL models have similar hyperparameters since they have similar underlying neural network models. Compared with other ML models, DL models benefit more from HPO since they often have many hyperparameters that require tuning.The first set of hyperparameters is related to the construction of the structure of a DL model; hence, named model design hyperparameters. Since all neural network models have an input layer and an output layer, the complexity of a deep learning model mainly depends on the number of hidden layers and the number of neurons of each layer, which are two main hyperparameters to build DL models DL2 . These two hyperparameters are set and tuned according to the complexity of the datasets or the problems. DL models need to have enough capacity to model objective functions (or prediction tasks) while avoiding overfitting. At the next stage, certain function types need to be set or tuned. The first function type to configure is the loss function type, which is chosen mainly based on the problem type (e.g.
, binary crossentropy for binary classification problems, multiclass crossentropy for multiclassification problems, and RMSE for regression problems). Another important hyperparameter is the activation function type used to model nonlinear functions, which be set to ’softmax’, ’rectified linear unit (ReLU)’, ’sigmoid’, ’tanh’, or ’softsign’. Lastly, the optimizer type can be set to stochastic gradient descent (SGD), adaptive moment estimation (Adam), root mean square propagation (RMSprop), etc.
DL3 .On the other hand, some other hyperparameters are related to the optimization and training process of DL models; hence, categorized as optimizer hyperparameters. The learning rate is one of the most important hyperparameters in DL models DL4
. It determines the step size at each iteration, which enables the objective function to converge. A large learning rate speeds up the learning process, but the gradient may oscillate around the local minimum value or even cannot converge. On the other hand, a small learning rate converges smoothly, but will largely increase model training time by requiring more training epochs. An appropriate learning rate enables the objective function to converge to a global minimum in a reasonable amount of time. Another common hyperparameter is the dropout rate. Dropout is a standard regularization method for DL models proposed to reduce overfitting. In dropout, a proportion of neurons are randomly selected and removed, and the percentage of neurons to be removed should be tuned.
Minibatch size and the number of epochs are the other two DL hyperparameters that represent the number of processed samples before updating the model, and the number of complete passes through the entire training set, respectively DL5 . Minibatch size is affected by the resource requirements of the training process, speed, and the number of iterations. The number of epochs depends on the size of the training set and should be tuned by slowly increasing its value until validation accuracy starts to decrease, which indicates overfitting. On the other hand, DL models often converge within a few epochs, and the following epochs may lead to unnecessary additional execution time and overfitting, which can be avoided by the early stopping method. Early stopping is a form of regularization whereby model training stops in advance when validation accuracy does not increase after a certain number of consecutive epochs. The number of waiting epochs, called early stop patience, can also be tuned to reduce model training time.
Apart from traditional DL models, transfer learning (TL) is a technology that obtains a pretrained model on the data in a related domain and transfers it to other target tasks
TL . To transfer a DL model from one problem to another problem, a certain number of top layers are frozen, and only the remaining layers are retrained to fit the new problem. Therefore, the number of frozen layers is a vital hyperparameter to tune if TL is used.3.2 Unsupervised Learning Algorithms
Unsupervised learning algorithms are a set of ML algorithms used to identify unknown patterns in unlabeled datasets. Clustering and dimensionalityreduction algorithms are the two main types of unsupervised learning methods. Clustering methods include kmeans, DBSCAN, EM, hierarchical clustering, etc.; while PCA and LDA are two commonlyused dimensionality reduction algorithms sklearnbook .
3.2.1 Clustering Algorithms
In most clustering algorithms — including kmeans, EM, and hierarchical clustering — the number of clusters is the most important hyperparameter to tune ncluster .
The kmeans algorithm kmeans2 uses prototypes, indicating the centroids of clusters, to cluster data. In kmeans algorithms, the number of clusters, ’n_clusters’, must be specified, and is determined by minimizing the sum of squared errors kmeans2 :
(21) 
where is the data matrix; , also called the centroid of the cluster , is the mean of the samples in the cluster; and is the number of sample points in the cluster .
To tune kmeans, ’n_clusters’ is the most crucial hyperparameter. Besides this, the method for centroid initialization, ’init’, could be set to ’kmeans++’, ’random’ or a humandefined array, which slightly affects model performance. In addition, ’n_init’, denoting the number of times that the kmeans algorithm will be executed with different centroid seeds, and the ’max_iter’, the maximum number of iterations in a single execution of kmeans, also have slight impacts on model performance sklearn .
The expectationmaximization (EM) algorithm EM
is an iterative algorithm used to detect the maximum likelihood estimation of parameters. Gaussian Mixture model is a clustering method that uses a mixture of Gaussian distributions to model data by implementing the EM method. Similar to kmeans, its major hyperparameter to be tuned is ’n_components’, indicating the number of clusters or Gaussian distributions. Additionally, different methods can be chosen to constrain the covariance of the estimated classes in Gaussian mixture models, including ’full covariance’, ’tied’, ’diagonal’ or ’spherical’
GMM . Other hyperparameters could also be tuned, including ’max_iter’ and ’tol’, representing the number of EM iterations to perform and the convergence threshold, respectively sklearn .Hierarchical clustering HC methods build clusters by continuously merging or splitting the builtin clusters. The hierarchy of clusters is represented by a treestructure; its root indicates the unique cluster gathering all samples, and its leaves represent the clusters with only one sample HC . In sklearn, the function ’AgglomerativeClustering’ is a common type of hierarchical clustering. In agglomerative clustering, the linkage criteria, ’linkage’, determines the distance between sets of observations and can be set to ’ward’, ’complete’, ’average’, or ’single’, indicating whether to minimize the variance of the all clusters, or use the maximum, average, or minimum distance between every two clusters, respectively. Like other clustering methods, its main hyperparameter is the number of clusters, ’n_clusters’. However, ’n_clusters’ cannot be set if we choose to set the ’distance_threshold’, the linkage distance threshold for merging clusters, since if so, ’n_clusters’ will be determined automatically.
DBSCAN DBSCAN1 is a densitybased clustering method that determines the clusters by dividing data into clusters with sufficiently high density. Unlike other clustering models, the number of clusters does not need to be configured before training. Instead, DBSCAN has two significant conditional hyperparameters — the scan radius represented by ’eps’, and the minimum number of considered neighbor points represented by ’min_samples’ — which define the cluster density together DBSCAN2 . DBSCAN works by starting with an unvisited point and detecting all its neighbor points within a predefined distance ’eps’. If the number of neighbor points reaches the value of ’min_samples’, this unvisited point and all its neighbors are defined as a cluster. The procedures are executed recursively until all data points are visited. A higher ’min_samples’ or a lower ’eps’ indicates a higher density to form a cluster.
3.2.2 Dimensionality Reduction Algorithms
The increasing amount of collected data provides ample information, while increasing problem complexity. In reality, many features are irrelevant or redundant to predict target variables. Dimensionality reduction algorithms often serve as feature engineering methods to extract important features and eliminate insignificant or redundant features. Two common dimensionalityreduction algorithms are principal component analysis (PCA) and linear discriminant analysis (LDA). In PCA and LDA, the number of features to be extracted, represented by ’n_components’ in sklearn, is the main hyperparameter to be tuned.
Principal component analysis (PCA) PCA is a widely used linear dimensionality reduction method. PCA is based on the concept of mapping the original dimensional features into
dimension features as the new orthogonal features, also called the principal components. PCA works by calculating the covariance matrix of the data matrix to obtain the eigenvectors of the covariance matrix. The matrix comprises the eigenvectors of
features with the largest eigenvalues (
i.e., the largest variance). Consequently, the data matrix can be transformed into a new space with reduced dimensionality. Singular value decomposition (SVD)
SVD is a popular method used to obtain the eigenvalues and eigenvectors of the covariance matrix of PCA. Therefore, in addition to ’n_components’, the SVD solver type is another hyperparameter of PCA to be tuned, which can be assigned to ’auto’, ’full’, ’arpack’ or ’randomized’ sklearn .Linear discriminant analysis (LDA) LDA1 is another common dimensionality reduction method that projects the features onto the most discriminative directions. Unlike PCA, which obtains the direction with the largest variance as the principal component, LDA optimizes the feature subspace of classification. The objective of LDA is to minimize the variance inside each class and maximize the variance between different classes after projection. Thus, the projection points in each class should be as close as possible, and the distance between the center points of different classes should be as large as possible. Similar to PCA, the number of features to be extracted, ’n_components’, should be tuned in LDA models. Additionally, the solver type of LDA can also be set to ’svd’ for SVD, ’lsqr’ for leastsquares solution, or ’eigen’ for eigenvalue decomposition LDA2 . LDA also has a conditional hyperparameter, the shrinkage parameter, ’shrinkage’, which can be set to a float value along with ’lsqr’ and ’eigen’ solvers.
4 Hyperparameter Optimization Techniques
4.1 Modelfree Algorithms
4.1.1 Babysitting
Babysitting, also called ’Trial and Error’ or grad student descent (GSD), is a basic hyperparameter tuning method ADL . This method is 100% manual tuning and widely used by students and researchers. The workflow is simple: after building a ML model, the student tests many possible hyperparameter values based on experience, guessing, or the analysis of previouslyevaluated results; the process is repeated until this student runs out of time (often reaching a deadline) or is satisfied with the results. As such, this approach requires a sufficient amount of prior knowledge and experience to identify optimal hyperparameter values with limited time.
Manual tuning is infeasible for many problems due to several factors, like a large number of hyperparameters, complex models, timeconsuming model evaluations, and nonlinear hyperparameter interactions BBHPO . These factors inspired increased research into techniques for the automatic optimization of hyperparameters BB2 .
4.1.2 Grid Search
Grid search (GS) is one of the most commonlyused methods to explore hyperparameter configuration space grid1 . GS can be considered an exhaustive search or a bruteforce method that evaluates all the hyperparameter combinations given to the grid of configurations grid2 . GS works by evaluating the Cartesian product of a userspecified finite set of values AMLB .
GS cannot exploit the wellperforming regions further by itself. Therefore, to identify the global optimums, the following procedure needs to be performed manually SOAML :

Start with a large search space and step size.

Narrow the search space and step size based on the previous results of wellperforming hyperparameter configurations.

Repeat step 2 multiple times until an optimum is reached.
GS can be easily implemented and parallelized. However, the main drawback of GS is its inefficiency for highdimensionality hyperparameter configuration space, since the number of evaluations increases exponentially as the number of hyperparameters grows. This exponential growth is referred to as the curse of dimensionality
Optunity . For GS, assuming that there are parameters, and each of them has distinct values, its computational complexity increases exponentially at a rate of PSODL . Thus, only when the hyperparameter configuration space is small can GS be an effective HPO method.4.1.3 Random Search
To overcome certain limitations of GS, random search (RS) was proposed in RS . RS is similar to GS; but, instead of testing all values in the search space, RS randomly selects a predefined number of samples between the upper and lower bounds as candidate hyperparameter values, and then trains these candidates until the defined budget is exhausted. The theoretical basis of RS is that if the configuration space is large enough, then the global optimums, or at least their approximations, can be detected. With a limited budget, RS is able to explore a larger search space than GS RS .
The main advantage of RS is that it is easily parallelized and resourceallocated since each evaluation is independent. Unlike GS, RS samples a fixed number of parameter combinations from the specified distribution, which improves system efficiency by reducing the probability of wasting much time on an unimportant small search space. Since the number of total evaluations in RS is set to a certain value before the optimization process starts, the computational complexity of RS is RStime . In addition, RS can detect the global optimum or the nearglobal optimum when given enough budgets AMLB .
Although RS is more efficient than GS for large search spaces, there are still a large number of unnecessary function evaluations since it does not exploit the previously wellperforming regions SOAML .
To conclude, the main limitation of both RS and GS is that every evaluation in their iterations is independent of previous evaluations; thus, they waste massive time evaluating poorlyperforming areas of the search space. This issue can be solved by other optimization methods, like Bayesian optimization that uses previous evaluation records to determine the next evaluation BOHP .
4.2 Gradientbased Optimization
Gradient descent GBO is a traditional optimization technique that calculates the gradient of variables to identify the optimal direction and moves towards the optimum. After randomly selecting a point, the technique moves towards the opposite direction of the largest gradient to locate the next point. Therefore, a local optimum can be reached after convergence. The local optimum is also the global optimum for convex functions. Gradientbased algorithms have a time complexity of for optimizing hyperparameters GBOtime .
For specific machine learning algorithms, the gradient of certain hyperparameters can be calculated, and then gradient descent can be used to optimize these hyperparameters. Although gradientbased algorithms have a faster convergence speed to reach local optimum than the previouslypresented methods in Section 4.1, they have several limitations. Firstly, they can only be used to optimize continuous hyperparameters because other types of hyperparameters, like categorical hyperparameters, do not have gradient directions. Secondly, they are only efficient for convex functions because the local instead of a global optimum may be reached for nonconvex functions SOAML . Therefore, the gradientbased algorithms can only be used in some cases where it is possible to obtain the gradient of hyperparameters; e.g., the learning rate in neural networks (NN) GBad . Still, it is not guaranteed for these ML algorithms to identify global optimums using gradientbased optimization techniques.
4.3 Bayesian Optimization
Bayesian optimization (BO) BO1 is an iterative algorithm that is popularly used for HPO problems. Unlike GS and RS, BO determines the future evaluation points based on the previouslyobtained results. To determine the next hyperparameter configuration, BO uses two key components: a surrogate model and an acquisition function RF . The surrogate model aims to fit all the currentlyobserved points into the objective function. After obtaining the predictive distribution of the probabilistic surrogate model, the acquisition function determines the usage of different points by balancing the tradeoff between exploration and exploitation. Exploration is to sample the instances in the areas that have not been sampled, while exploitation is to sample in the current regions where the global optimum is most likely to occur, based on the posterior distribution. BO models balance the exploration and the exploitation processes to detect the current most likely optimal regions and avoid missing better configurations in the unexplored areas BO2 .
The basic procedures of BO are as follows BO1 :

Build a probabilistic surrogate model of the objective function.

Detect the optimal hyperparameter values on the surrogate model.

Apply these hyperparameter values to the real objective function to evaluate them.

Update the surrogate model with new results.

Repeat steps 2  4 until the maximum number of iterations is reached.
Thus, BO works by updating the surrogate model after each evaluation on the objective function. BO is more efficient than GS and RS since it can detect the optimal hyperparameter combinations by analyzing the previouslytested values, and running a surrogate model is often much cheaper than running a real objective function.
However, since Bayesian optimization models are executed based on the previouslytested values, they belong to sequential methods that are difficult to parallelize; but they can usually detect nearoptimal hyperparameter combinations within a few iterations EHPO .
Common surrogate models for BO include Gaussian process (GP) GP , random forest (RF) SMAC , and the tree Parzen estimator (TPE) AHPO . Therefore, there are three main types of BO algorithms based on their surrogate models: BOGP, BORF, BOTPE. An alternative name for BORF is sequential modelbased algorithm configuration (SMAC) SMAC .
4.3.1 BoGp
Gaussian process (GP) is a standard surrogate model for objective function modeling in BO BO1 . Assuming that the function with a mean and a covariance
is a realization of a GP, the predictions follow a normal distribution
BOs :(22) 
where is the configuration space of hyperparameters, and is the evaluation result of each hyperparameter value
. After obtaining a set of predictions, the points to be evaluated next are then selected from the confidence intervals generated by the BOGP model. Each newlytested data point is added to the sample records, and the BOGP model is rebuilt with the new information. This procedure is repeated until termination.
4.3.2 Smac
Random forest (RF) is another popular surrogate function for BO to model the objective function using an ensemble of regression trees. BO using RF as the surrogate model is also called SMAC SMAC .
Assuming that there is a Gaussian model , and and are the mean and variance of the regression function , respectively, then SMAC :
(23) 
(24) 
where is a set of regression trees in the forest. The major procedures of SMAC are as follows AMLSC :

RF starts with building regression trees, each constructed by sampling instances from the training set with replacement.

A split node is selected from hyperparameters for each tree.

To maintain a low computational cost, both the minimum number of instances considered for further split and the number of trees to grow are set to a certain value.

Finally, the mean and variance for each new configuration are estimated by RF.
Compared with BOGP, the main advantage of SMAC is its support for all types of variables, including continuous, discrete, categorical, and conditional hyperparameters BOs . The time complexities of using SMAC to fit and predict variances are and , respectively, which are much lower than the complexities of BOGP AMLSC .
4.3.3 BoTpe
Treestructured Parzen estimator (TPE) AHPO is another common surrogate model for BO. Instead of defining a predictive distribution used in BOGP, BOTPE creates two density functions, and , to act as the generative models for all domain variables AMLSC . To apply TPE, the observation results are divided into good results and poor results by a predefined percentile , and the two sets of results are modeled by simple Parzen windows AHPO :
(25) 
After that, the expected improvement in the acquisition function is reflected by the ratio between the two density functions, which is used to determine the new configurations for evaluation. The Parzen estimators are organized in a tree structure, so the specified conditional dependencies are retained. Therefore, TPE naturally supports specified conditional hyperparameters BOs . The time complexity of BOTPE is , which is lower than the complexity of BOGP AMLSC .
BO methods are effective for many HPO problems, even if the objective function is stochastic, nonconvex, or noncontinuous. However, the main drawback of BO models is that, if they fail to achieve the balance between exploration and exploitation, they might only reach the local instead of a global optimum. RS does not have this limitation since it does not focus on any specific area. Additionally, it is difficult to parallelize BO models since their intermediate results are dependent on each other EHPO .
4.4 Multifidelity Optimization Algorithms
One major issue with HPO is the long execution time, which increases with a larger number of hyperparameter values and larger datasets. The execution time may be several hours, several days, or even more HPS . Multifidelity optimization techniques are common approaches to solve the constraint of limited time and resources. To save time, people can use a subset of the original dataset or a subset of the features subset . Multifidelity involves lowfidelity and highfidelity evaluations and combines them for practical applications multifidelity . In lowfidelity evaluations, a relatively small subset is evaluated at a low cost but with poor generalization performance. In highfidelity evaluations, a relatively large subset is evaluated with better generalization performance but at a higher cost than lowfidelity evaluations. In multifidelity optimization algorithms, poorlyperforming configurations are discarded after each round of hyperparameter evaluation on generated subsets, and only wellperforming hyperparameter configurations will be evaluated on the entire training set.
Banditbased algorithms categorized to multifidelity optimization algorithms have shown success in dealing with deep learning optimization problems AMLSC . Two common banditbased techniques are successive halving SH and Hyperband Hyperband .
4.4.1 Successive Halving
Theoretically speaking, exhaustive methods are able to identify the best hyperparameter combination by evaluating all the given combinations. However, many factors, including limited time and resources, should be considered in practical applications. These factors are called budgets (). To overcome the limitations of GS and RS and to improve efficiency, successive halving algorithms were proposed in SH .
The main process of using successive halving algorithms for HPO is as follows. Firstly, it is presumed that there are sets of hyperparameter combinations, and that they are evaluated with uniformlyallocated budgets (). Then, according to the evaluation results for each iteration, half of the poorlyperforming hyperparameter configurations are eliminated, and the betterperforming half is passed to the next iteration with double budgets (). The above process is repeated until the final optimal hyperparameter combination is detected.
Successive halving is more efficient than RS, but is affected by the tradeoff between the number of hyperparameter configurations and the budgets allocated to each configuration AMLB . Thus, the main concern of successive halving is how to allocate the budget and how to determine whether to test fewer configurations with a higher budget for each or to test more configurations with a lower budget for each SOAML .
4.4.2 Hyperband
Hyperband Hyperband is then proposed to solve the dilemma of successive halving algorithms by dynamically choosing a reasonable number of configurations. It aims to achieve a tradeoff between the number of hyperparameter configurations () and their allocated budgets by dividing the total budgets () into pieces and allocating these pieces to each configuration (). Successive halving serves as a subroutine on each set of random configurations to eliminate the poorlyperforming hyperparameter configurations and improve efficiency. The main steps of Hyperband algorithms are shown in Algorithm 1 SOAML .
Firstly, the budget constraints and are determined by the total number of data points, the minimum number of instances required to train a sensible model, and the available budgets. After that, the number of configurations and the budget size allocated to each configuration are calculated based on and in steps 23 of Algorithm 1. The configurations are sampled based on and , and then passed to the successive halving model demonstrated in steps 45. The successive halving algorithm discards the identified poorlyperforming configurations and passes the wellperforming configurations on to the next iteration. This process is repeated until the final optimal hyperparameter configuration is identified. By involving the successive halving searching method, Hyperband has a computational complexity of Hyperband .
4.4.3 Bohb
Bayesian Optimization HyperBand (BOHB) BOHB
is a stateoftheart HPO technique that combines Bayesian optimization and Hyperband to incorporate the advantages of both while avoiding their drawbacks. The original Hyperband uses a random search to search the hyperparameter configuration space, which has a low efficiency. BOHB replaces the RS method by BO to achieve both high performance as well as low execution time by effectively using parallel resources to optimize all types of hyperparameters. In BOHB, TPE is the standard surrogate model for BO, but it uses multidimensional kernel density estimators. Therefore, the complexity of BOHB is also
BOHB .It has been shown that BOHB outperforms many other optimization techniques when tuning SVM and DL models BOHB . The only limitation of BOHB is that it requires the evaluations on subsets with small budgets to be representative of evaluations on the entire training set; otherwise, BOHB may have a slower convergence speed than standard BO models.
4.5 Metaheuristic Algorithms
Metaheuristic algorithms Metaheuristic are a set of algorithms mainly inspired by biological theories and widely used for optimization problems. Unlike many traditional optimization methods, metaheuristics have the capacity to solve nonconvex, noncontinuous, or nonsmooth optimization problems.
Populationbased optimization algorithms (POAs) are a major type of metaheuristic algorithm, including genetic algorithms (GAs), evolutionary algorithms, evolutionary strategies, and particle swarm optimization (PSO). POAs start by creating and updating a population as each generation; each individual in every generation is then evaluated until the global optimum is identified
BOHP . The main differences between different POAs are the methods used to generate and select populations HumanAML . POAs can be easily parallelized since a population of individuals can be evaluated on at most threads or machines in parallel AMLB . Genetic algorithms and particle swarm optimization are the two main POAs that are popularlyused for HPO problems.4.5.1 Genetic Algorithm
Genetic algorithm (GA) GA2 is one of the common metaheuristic algorithms based on the evolutionary theory that individuals with the best survival capability and adaptability to the environment are more likely to survive and pass on their capabilities to future generations. The next generation will also inherit their parents’ characteristics and may involve better and worse individuals. Better individuals will be more likely to survive and have more capable offspring, while the worse individuals will gradually disappear. After several generations, the individual with the best adaptability will be identified as the global optimum GA3 .
To apply GA to HPO problems, each chromosome or individual represents a hyperparameter, and its decimal value is the actual input value of the hyperparameter in each evaluation. Every chromosome has several genes, which are binary digits; and then crossover and mutation operations are performed on the genes of this chromosome. The population involves all possible values within the initialized chromosome/parameter ranges, while the fitness function characterizes the evaluation metrics of the parameters
GA3 .Since the randomlyinitialized parameter values often do not include the optimal parameter values, several operations on the wellperforming chromosomes, including selection, crossover, and mutation operations, must be performed to identify the optimums GA2 . Chromosome selection is implemented by selecting those chromosomes with good fitness function values. To keep the population size unchanged, the chromosomes with good fitness function values are passed to the next generation with higher probability, where they generate new chromosomes with the parents’ best characteristics. Chromosome selection ensures that good characteristics of each generation can be passed to later generations. Crossover is used to generate new chromosomes by exchanging a proportion of genes in different chromosomes. Mutation operations are also used to generate new chromosomes by randomly altering one or more genes of a chromosome. Crossover and mutation operations enable later generations to have different characteristics and reduce the chance of missing some good characteristics AMLSC .
The main procedures of GA are as follows Metaheuristic :

Randomly initialize the population, chromosomes, and genes representing the entire search space, hyperparameters, and hyperparameter values, respectively.

Evaluate the performance of each individual in the current generation by calculating the fitness function, which indicates the objective function of a ML model.

Perform selection, crossover, and mutation operations on the chromosomes to produce a new generation involving the next hyperparameter configurations to be evaluated.

Repeat steps 2 & 3 until the termination condition is met.

Terminate and output the optimal hyperparameter configuration.
Among the above steps, the population initialization step is an important step of GA and PSO since it provides an initial guess of the optimal values. Although the initialized values will be iteratively improved in the optimization process, a suitable population initialization method can significantly improve the convergence speed and performance of POAs. A good initial population of hyperparameters should involve individuals that are close to global optimums by covering the promising regions and should not be localized to an unpromising region of the search space goodini .
To generate hyperparameter configuration candidates for the initial population, random initialization that simply creates the initial population with random values in the given search space is often used in GA ini1 . Thus, GA is easily implemented and does not necessitate good initializations, because its selection, crossover, and mutation operations lower the possibility of missing the global optimum.
Hence, it is useful when the data analyst does not have much experience determining a potential appropriate initial search space for the hyperparameters. The main limitation of GA is that the algorithm itself introduces additional hyperparameters to be configured, including the fitness function type, population size, crossover rate, and mutation rate. Moreover, GA is a sequential execution algorithm, making it difficult to parallelize. The time complexity of GA is GAtime . As a result, sometimes, GA may be inefficient due to low convergence speed.
4.5.2 Particle Swarm Optimization
Particle swarm optimization (PSO) PSO1 is another set of evolutionary algorithms that are commonly used for optimization problems. PSO algorithms are inspired by biological populations that exhibit both individual and social behaviors HumanAML . PSO works by enabling a group of particles (swarm) to traverse the search space in a semirandom manner BBHPO . PSO algorithms identify the optimal solution through cooperation and information sharing among individual particles in a group.
In PSO, there are a group of particles in a swarm SOAML :
(26) 
and each particle is represented by a vector:
(27) 
where is the current position, is the current velocity, and is the known best position of the particle so far.
PSO initially generates each particle with a random position and a random velocity. Every particle evaluates the current position and records the position with its performance score. In the next iteration, the velocity of each particle is changed based on the previous position and the current global optimal position :
(28) 
where
is the continuous uniform distributions based on the acceleration constants
and .After that, the particles move based on their new velocity vectors:
(29) 
The above procedures are repeated until convergence or termination constraints are reached.
Compared with GA, it is easier to implement PSO, since PSO does not have certain additional operations like crossover and mutation. In GA, all chromosomes share information with each other, so the entire population moves uniformly toward the optimal region; while in PSO, only information on the individual best particle and the global best particle is transmitted to others, which is a oneway flow of information sharing, and the entire search process follows the direction of the current optimal solution SOAML . The computational complexity of PSO algorithm is PSOtime . In most cases, the convergence speed of PSO is faster than of GA. In addition, particles in PSO operate independently and only need to share information with each other after each iteration, so this process is easily parallelized to improve model efficiency BBHPO .
The main limitation of PSO is that it requires proper population initialization; otherwise, it might only reach a local instead of a global optimum, especially for discrete hyperparameters PSOdiscrete . Proper population initialization requires developers’ prior experience or can be obtained by population initialization techniques. Many population initialization techniques have been proposed to improve the performance of evolutionary algorithms, like the oppositionbased optimization algorithm ini1 and the space transformation search method ini2 . Involving additional population initialization techniques will require more execution time and resources.
5 Applying Optimization Techniques to Machine Learning Algorithms
5.1 Optimization Techniques Analysis
Grid search (GS) is a simple method, its major limitation being that it is timeconsuming and impacted by the curse of dimensionality Optunity . Thus, it is unsuitable for a large number of hyperparameters. Moreover, GS is often not able to detect the global optimum of continuous parameters, since it requires a predefined, finite set of hyperparameter values. It is also not realistic for GS to be used to identify integer and continuous hyperparameter optimums with limited time and resources. Therefore, compared with other techniques, GS is only efficient for a small number of categorical hyperparameters.
Random search is more efficient than GS and supports all types of hyperparameters. In practical applications, using RS to evaluate the randomlyselected hyperparameter values helps analysts to explore a large search space. However, since RS does not consider previouslytested results, it may involve many unnecessary evaluations, which decrease its efficiency RS .
Hyperband can be considered an improved version of RS, and they both support parallel executions. Hyperband balances model performance and resource usage, so it is more efficient than RS, especially with limited time and resources surrogates . However, GS, RS, and Hyperband all have a major constraint in that they treat each hyperparameter independently and do not consider hyperparameter correlations HBBO . Thus, they will be inefficient for ML algorithms with conditional hyperparameters, like SVM, DBSCAN, and logistic regression.
Gradientbased algorithms are not a prevalent choice for hyperparameter optimization, since they only support continuous hyperparameters and can only detect local instead of a global optimum for nonconvex HPO problems SOAML . Therefore, gradientbased algorithms can only be used to optimize certain hyperparameters, like the learning rate in DL models.
Bayesian optimization models are divided into three different models — BOGP, SMAC, and BOTPE — based on their surrogate models. BO algorithms determine the next hyperparameter value based on the previouslyevaluated results to reduce unnecessary evaluations and improve efficiency. BOGP mainly supports continuous and discrete hyperparameters (by rounding them), but does not support conditional hyperparameters BOHP ; while SMAC and BOTPE are both able to handle categorical, discrete, continuous, and conditional hyperparameters. SMAC performs better when there are many categorical and conditional parameters, or crossvalidation is used, while BOGP performs better for only a few continuous parameters surrogates . BOTPE preserves the specified conditional relationships, so one advantage of BOTPE over BOGP is its innate support for specified conditional hyperparameters BOHP .
Metaheuristic algorithms, including GA and PSO, are more complicated than many other HPO algorithms, but often perform well for complex optimization problems. They support all types of hyperparameters and are particularly efficient for large configuration spaces, since they can obtain the nearoptimal solutions even within very few iterations. However, GA and PSO have their own advantages and disadvantages in practical use. PSO is able to support largescale parallelization, and is particularly suitable for continuous and conditional HPO problems PSODL ; on the other hand, GA is executed sequentially, making it difficult to be parallelized. Therefore, PSO often executes faster than GA, especially for large configuration spaces and large datasets. However, an appropriate population initialization is crucial for PSO; otherwise, it may converge slowly or only identify a local instead of a global optimum. Yet, the impact of proper population initialization is not as significant for GA as for PSO PSO3 . Another limitation of GA is that it introduces additional hyperparameters, like its crossover and mutation rates GA2 .
The strengths and limitations of the hyperparameter optimization algorithms involved in this paper are summarized in Table 1.


HPO Method  Strengths  Limitations  Time Complexity  


GS  · Simple. 


RS 



Gradientbased models  · Fast convergence speed for continuous HPs.  · Only support continuous HPs. · May only detect local optimums.  
BOGP 



SMAC  · Efficient with all types of HPs.  · Poor capacity for parallelization.  
BOTPE 

· Poor capacity for parallelization.  
Hyperband  · Enable parallelization. 


BOHB  · Efficient with all types of HPs. · Enable parallelization.  · Require subsets with small budgets to be representative.  
GA 

· Poor capacity for parallelization.  
PSO 

· Require proper initialization.  

5.2 Apply HPO Algorithms to ML Models
Since there are many different HPO methods for different use cases, it is crucial to select the appropriate optimization techniques for different ML models.
Firstly, if we have access to multiple fidelities, which means that it is able to define meaningful budgets: the performance rankings of hyperparameter configurations evaluated on small budgets should be the same as or similar to the configuration rankings on the full budget (the original dataset); BOHB would be the best choice, since it has the advantages of both BO and Hyperband AMLB BOHB .
On the other hand, if multiple fidelities are not applicable, which means that using the subsets of the original dataset or the subsets of original features is misleading or too noisy to reflect the performance of the entire dataset, BOHB may perform poorly with higher time complexity than standard BO models, then choosing other HPO algorithms would be more efficient BOHB .
ML algorithms can be classified by the characteristics of their hyperparameter configurations. Appropriate optimization algorithms can be chosen to optimize the hyperparameters based on these characteristics.
5.2.1 One Discrete Hyperparameter
Commonly for some ML algorithms, like certain neighborbased, clustering, and dimensionality reduction algorithms, only one discrete hyperparameter needs to be tuned. For KNN, the major hyperparameter is , the number of considered neighbors. The most essential hyperparameter of kmeans, hierarchical clustering, and EM is the number of clusters. Similarly, for dimensionality reduction algorithms, including PCA and LDA, their basic hyperparameter is ’n_components’, the number of features to be extracted.
In these situations, Bayesian optimization is the best choice, and the three surrogates could be tested to find the best one. Hyperband is another good choice, which may have a fast execution speed due to its capacity for parallelization. In some cases, people may want to finetune the ML model by considering other less important hyperparameters, like the distance metric of KNN and the SVD solver type of PCA; so BOTPE, GA, or PSO could be chosen for these situations.
5.2.2 One Continuous Hyperparameter
Some linear models, including ridge and lasso algorithms, and some naïve Bayes algorithms, involving multinomial NB, Bernoulli NB, and complement NB, generally only have one vital continuous hyperparameter to be tuned. In ridge and lasso algorithms, the continuous hyperparameter is ’alpha’, the regularization strength. In the three NB algorithms mentioned above, the critical hyperparameter is also named ’alpha’, but it represents the additive (Laplace/Lidstone) smoothing parameter. In terms of these ML algorithms, BOGP is the best choice, since it is good at optimizing a small number of continuous hyperparameters. Gradientbased algorithms can also be used, but might only detect local optimums, so they are less effective than BOGP.
5.2.3 A Few Conditional Hyperparameters
It is noticeable that many ML algorithms have conditional hyperparameters, like SVM, LR, and DBSCAN. LR has three correlated hyperparameters, ’penalty’, ’’, and the solver type. Similarly, DBSCAN has ’eps’ and ’min_samples’ that must be tuned in conjunction. SVM is more complex, since after setting a different kernel type, there is a separate set of conditional hyperparameters that need to be tuned next, as described in Section 3.1.3. Hence, some HPO methods that cannot effectively optimize conditional hyperparameters, including GS, RS, BOGP, and Hyperband, are not suitable for ML models with conditional hyperparameters. For these ML methods, BOTPE is the best choice if we have predefined relationships among the hyperparameters. SMAC is also a good choice, since it also performs well for tuning conditional hyperparameters. GA and PSO can be used, as well.
5.2.4 A Large Hyperparameter Configuration Space with Multiple Types of Hyperparameters
In ML, treebased algorithms, including DT, RF, ET, and XGBoost, as well as DL algorithms, like DNN, CNN, RNN, are the most complex types of ML algorithms to bed tuned, since they have many hyperparameters with various, different types. For these ML models, PSO is the best choice since it enables parallel executions to improve efficiency, particularly for DL models that often require massive training time. Some other techniques, like GA, BOTPE, and SMAC can also be used, but they may cost more time than PSO, since it is difficult to parallelize these techniques.
5.2.5 Categorical Hyperparameters
This category of hyperparameters is mainly for ensemble learning algorithms, since their major hyperparameter is a categorical hyperparameter. For bagging and AdaBoost, the categorical hyperparameter is ’base_estimator’, which is set to be a singular ML model. For voting, it is ’estimators’, indicating a list of ML singular models to be combined. The voting method has another categorical hyperparameter, ’voting’, which is used to choose whether to use a hard or soft voting method. If we only consider these categorical hyperparameters, GS would be sufficient to test their suitable base machine learners. On the other hand, in many cases, other hyperparameters need to be considered, like ’n_estimators’, ’max_samples’, and ’max_features’ in bagging, as well as ’n_estimators’ and ’learning_rate’ in AdaBoost; consequently, BO algorithms would be a better choice to optimize these continuous or discrete hyperparameters.
In conclusion, when tuning a ML model to achieve high model performance and low computational costs, the most suitable HPO algorithm should be selected based on the properties of its hyperparameters.
6 Existing HPO Frameworks
To tackle HPO problems, many opensource libraries exist to apply theory into practice and lower the threshold for ML developers. In this section, we provide a brief introduction to some popular opensource HPO libraries or frameworks mainly for Python programming. The principles behind the involved optimization algorithms are provided in Section 4.
6.1 Sklearn
In sklearn sklearn , ’GridSearchCV’ can be implemented to detect the optimal hyperparameters using the GS algorithm. Each hyperparameter value in the humandefined configuration space is evaluated by the program, with its performance evaluated using crossvalidation. When all the instances in the configuration space have been evaluated, the optimal hyperparameter combination in the defined search space with its performance score will be returned.
’RandomizedSearchCV’ is also provided in sklearn to implement a RS method. It evaluates a predefined number of randomlyselected hyperparameter values in parallel. Crossvalidation is conducted to effectively evaluate the performance of each configuration.
6.2 Spearmint
Spearmint BO1 is a library using Bayesian optimization with the Gaussian process as the surrogate model. Spearmint’s primary deficiency is that it is not very efficient for categorical and conditional hyperparameters.
6.3 BayesOpt
Bayesian Optimization (BayesOpt) BayesOpt is a Python library employed to solve HPO problems using BO. BayesOpt uses a Gaussian process as its surrogate model to calculate the objective function based on past evaluations and utilizes an acquisition function to determine the next values.
6.4 Hyperopt
Hyperopt Hyperopt is a HPO framework that involves RS and BOTPE as the optimization algorithms. Unlike some of the other libraries that only support a single model, Hyperopt is able to use multiple instances to model hierarchical hyperparameters. In addition, Hyperopt is parallelizable since it uses MongoDb as the central database to store the hyperparameter combinations. hyperoptsklearn hyperoptsklearn and hyperas hyperas are the two libraries that can apply Hyperopt to scikitlearn and Keras libraries.
6.5 Smac
6.6 Bohb
BOHB framework BOHB is a combination of Bayesian optimization and Hyperband surrogates . It overcomes one limitation of Hyperband, in that it randomly generates the test configurations, by replacing this procedure by BO. TPE is used as the surrogate model to store and model function evaluations. Using BOHB to evaluate the instance can achieve a tradeoff between model performance and the current budget.
6.7 Optunity
Optunity Optunity is a popular HPO framework that provides several optimization techniques, including GS, RS, PSO, and BOTPE. In Optunity, categorical hyperparameters are converted to discrete hyperparameters by indexing, and discrete hyperparameters are processed as continuous hyperparameters by rounding them; as such, it supports all kinds of hyperparameter.
6.8 Skopt
6.9 GpFlowOpt
GpFlowOpt GPflowOpt
is a Python library for BO using GP as the surrogate model. It supports running BOGP on GPU using the Tensorflow library. Therefore, GpFlowOpt is a good choice if BO is used in deep learning models, and GPU resources are available.
6.10 Talos
Talos talos is a Python package designed for hyperparameter optimization with Keras models. Talos can be fully deployed into any Keras models and implemented easily without learning any new syntax. Several optimization techniques, including GS, RS, and probabilistic reduction, can be implemented using Talos.
6.11 Sherpa
Sherpa SHERPA is a Python package used for HPO problems. It can be used with other ML libraries, including sklearn sklearn , Tensorflowtf , and Keras keras . It supports parallel computations and has several optimization methods, including GS, RS, BOGP (via GPyOpt), Hyperband, and populationbased training (PBT).
6.12 Osprey
Osprey Osprey is a Python library designed to optimize hyperparameters. Several HPO strategies are available in Osprey, including GS, RS, BOTPE (via Hyperopt), and BOGP (via GPyOpt).
6.13 FarHo
FARHO FARHO
is a hyperparameter optimization package that employs gradientbased algorithms with TensorFlow. FARHO contains a few gradientbased optimizers, like reverse hypergradient and forward hypergradient methods. This library is designed to build access to the gradientbased hyperparameter optimizers in TensorFlow, allowing deep learning model training and hyperparameter optimization in GPU or other tensoroptimized computing environments.
6.14 Hyperband
Hyperband Hyperband is a Python package for tuning hyperparameters by Hyperband, a banditbased approach. Similar to ’GridSearchCV’ and ’RandomizedSearchCV’ in scikitlearn, there is a class named ’HyperbandSearchCV’ in Hyperband that can be combined with sklearn and used for HPO problems. In ’HyperbandSearchCV’ method, crossvalidation is used for evaluation.
6.15 Deap
DEAP DEAP
is a novel evolutionary computation package for Python that contains several evolutionary algorithms like GA and PSO. It integrates with parallelization mechanisms like multiprocessing, and machine learning packages like sklearn.
6.16 Tpot
TPOT TPOT
is a Python tool for autoML that uses genetic programming to optimize ML pipelines. TPOT is built on top of sklearn, so it is easy to implement TPOT on ML models. ’TPOTClassifier’ is its principal function, and several additional hyperparameters of GA must be set to fit specific problems.
6.17 Nevergrad
Nevergrad nevergrad is an opensource Python library that includes a wide range of optimizers, like fastGA and PSO. In ML, Nevergrad can be used to tune all types of hyperparameters, including discrete, continuous, and categorical hyperparameters, by choosing different optimizers.
7 Experiments
To summarize the content of Sections 3 to 6, a comprehensive overview of applying hyperparameter optimization techniques to ML models is shown in Table 2. It provides a summary of common ML algorithms, their hyperparameters, suitable optimization methods, and available Python libraries; thus, data analysts and researchers can look up this table and select suitable optimization algorithms as well as libraries for practical use.


ML Algorithm  Main HPs  Optional HPs  HPO methods  Libraries  


Linear regression          
Ridge & lasso  alpha    BOGP  Skpot  
Logistic regression 

 



KNN  n_neighbors 




SVM 





NB  alpha    BOGP  Skpot  
DT 





RF & ET 





XGBoost 





Voting 

weights  GS  sklearn  
Bagging 





AdaBoost 

 



Deep learning 





Kmeans  n_clusters 




Hierarchical clustering 

linkage 



DBSCAN 

 



Gaussian mixture  n_components 

BOGP  Skpot  
PCA  n_components  svd_solver 



LDA  n_components 





To put theory into practice, several experiments have been conducted based on Table 2. This section provides the experiments of applying eight different HPO techniques to three common and representative ML algorithms on two benchmark datasets. In the first part of this section, the experimental setup and the main process of HPO are discussed. In the second part, the results of utilizing different HPO methods are compared and analyzed. The sample code of the experiments has been published in github to illustrate the process of applying hyperparameter optimization to ML models.
7.1 Experimental Setup
Based on the steps to optimize hyperparameters discussed in Section 2.2, several steps were completed before the actual optimization experiments.
Firstly, two standard benchmarking datasets provided by the sklearn library sklearn , namely, the Modified National Institute of Standards and Technology dataset (MNIST) and the Boston housing dataset, are selected as the benchmark datasets for HPO method evaluation on data analytics problems. MNIST is a handwritten digit recognition dataset used as a multiclassification problem, while the Boston housing dataset contains information about the price of houses in various places in the city of Boston and can be used as a regression dataset to predict the housing prices.
At the next stage, the ML models with their objective function need to be configured. In Section 5, all common ML models are divided into five categories based on their hyperparameter types. Among those ML categories, ”one discrete hyperparameter”, ”a few conditional hyperparameters”, and ”a large hyperparameter configuration space with multiple types of hyperparameters” are the three most common cases. Thus, three ML algorithms, KNN, SVM, and RF, are selected as the target models to be optimized, since their hyperparameter types represent the three most common HPO cases: KNN has one important hyperparameter, the number of considered nearest neighbors for each sample; SVM has a few conditional hyperparameters, like the kernel type and the penalty parameter ; RF has multiple hyperparameters of different types, as discussed in Section 3. Moreover, KNN, SVM, and RF can all be applied to solve both classification and regression problems.
In the next step, the performance metric and evaluation method are configured. For each experiment on the selected two datasets, 3fold cross validation is implemented to evaluate the involved HPO methods. The two most commonlyused performance metrics are used in our experiments. For classification models, accuracy is used as the classifier performance metric, which is the proportion of correctly classified data; while for regression models, the mean squared error (MSE) is used as the regressor performance metric, which measures the average squared difference between the predicted values and the actual values. Additionally, the computational time (CT) , the total time needed to complete a HPO process with 3fold crossvalidation, is also used as the model efficiency metric IDSme . In each experiment, the optimal ML model architecture that has the highest accuracy or the lowest MSE will be returned with the optimal hyperparameter configuration.
After that, to fairly compare different optimization algorithms and frameworks, certain constraints should be satisfied. Firstly, we compare different HPO methods using the same hyperparameter configuration space. For KNN, the only hyperparameter to be optimized, ’n_neighbors’, is set to be in the same range of 1 to 20 for each optimization method evaluation. The hyperparameters of SVM and RF models for classification and regression problems are also set to be in the same configuration space for each type of problem. The specifics of the configuration space for ML models are shown in Table 3. The selected hyperparameters and their search space are determined based on the concepts in Section 3, domain knowledge, and manual testings grid1 . The hyperparameter types of each ML algorithm are also summarized in Table 3.


ML Model  Hyperparameter  Type  Search Space 


RF Classifier  n_estimators  Discrete  [10,100] 
max_depth  Discrete  [5,50]  
min_samples_split  Discrete  [2,11]  
min_samples_leaf  Discrete  [1,11]  
criterion  Categorical  [’gini’, ’entropy’]  
max_features  Discrete  [1,64]  
SVM Classifier  C  Continuous  [0.1,50] 
kernel  Categorical  [’linear’, ’poly’, ’rbf’, ’sigmoid’]  
KNN Classifier  n_neighbors  Discrete  [1,20] 
RF Regressor  n_estimators  Discrete  [10,100] 
max_depth  Discrete  [5,50]  
min_samples_split  Discrete  [2,11]  
min_samples_leaf  Discrete  [1,11]  
criterion  Categorical  [’mse’, ’mae’]  
max_features  Discrete  [1,13]  
SVM Regressor  C  Continuous  [0.1,50] 
kernel  Categorical  [’linear’, ’poly’, ’rbf’, ’sigmoid’]  
epsilon  Continuous  [0.001,1]  
KNN Regressor  n_neighbors  Discrete  [1,20] 

On the other hand, to fairly compare the performance metrics of optimization techniques, the maximum number of iterations for all HPO methods is set to 50 for RF and SVM model optimizations, and 10 for KNN model optimization based on manual testings and domain knowledge. Moreover, to avoid the impacts of randomness, all experiments are repeated ten times with different random seeds, and results are averaged for regression problems or given the majority vote for classification problems.
In Section 4, more than ten HPO methods are introduced. In our experiments, eight representative HPO approaches are selected for performance comparison, including GS, RS, BOGP, BOTPE, Hyperband, BOHB, GA, and PSO. After setting up the fair experimental environments for each HPO method, the HPO experiments are implemented based on the steps discussed in Section 2.2.
All experiments were conducted using Python 3.5 on a machine with 6 Core i78700 processor and 16 gigabytes (GB) of memory. The involved ML and HPO algorithms are evaluated using multiple opensource Python libraries and frameworks introduced in Section 6, including sklearn sklearn , Skopt SKOPT , Hyperopt Hyperopt , Optunity Optunity , Hyperband Hyperband , BOHB BOHB , and TPOT TPOT .
7.2 Performance Comparison
The experiments of applying eight different HPO methods to ML models are summarized in Tables 4 to 9. Tables 4 to 6 provide the performance of each optimization algorithm when applied to RF, SVM, and KNN classifiers evaluated on the MNIST dataset after a complete optimization process; while Tables 7 to 9 demonstrate the performance of each HPO method when applied to RF, SVM, and KNN regressors evaluated on the Bostonhousing dataset. In the first step, each ML model with their default hyperparameter configurations is trained and evaluated as baseline models. After that, each HPO algorithm is implemented on the ML models to evaluate and compare their accuracies for classification problems, or MSEs for regression problems, and their computational time (CT).


Optimization Algorithm  Accuracy (%)  CT (s) 


Default HPs  90.65  0.09 
GS  93.32  48.62 
RS  93.38  16.73 
BOGP  93.38  20.60 
BOTPE  93.88  12.58 
Hyperband  93.38  8.89 
BOHB  93.38  9.45 
GA  93.83  19.19 
PSO  93.73  12.43 



Optimization Algorithm  Accuracy (%)  CT (s) 


Default HPs  97.05  0.29 
GS  97.44  32.90 
RS  97.35  12.48 
BOGP  97.50  17.56 
BOTPE  97.44  3.02 
Hyperband  97.44  11.37 
BOHB  97.44  8.18 
GA  97.44  16.89 
PSO  97.44  8.33 



Optimization Algorithm  Accuracy (%)  CT (s) 


Default HPs  96.27  0.24 
GS  96.22  7.86 
RS  96.33  6.44 
BOGP  96.83  1.12 
BOTPE  96.83  2.33 
Hyperband  96.22  4.54 
BOHB  97.44  3.84 
GA  96.83  2.34 
PSO  96.83  1.73 



Optimization Algorithm  MSE  CT (s) 


Default HPs  31.26  0.08 
GS  29.02  4.64 
RS  27.92  3.42 
BOGP  26.79  17.94 
BOTPE  25.42  1.53 
Hyperband  26.14  2.56 
BOHB  25.56  1.88 
GA  26.95  4.73 
PSO  25.69  3.20 



Optimization Algorithm  MSE  CT (s) 


Default HPs  77.43  0.02 
GS  67.07  1.33 
RS  61.40  0.48 
BOGP  61.27  5.87 
BOTPE  59.40  0.33 
Hyperband  73.44  0.32 
BOHB  59.67  0.31 
GA  60.17  1.12 
PSO  58.72  0.53 



Optimization Algorithm  MSE  CT (s) 


Default HPs  81.48  0.004 
GS  81.53  0.12 
RS  80.77  0.11 
BOGP  80.77  0.49 
BOTPE  80.83  0.08 
Hyperband  80.87  0.10 
BOHB  80.77  0.09 
GA  80.77  0.33 
PSO  80.74  0.19 

From Tables 4 to 9, we can see that using the default HP configurations do not yield the best model performance in our experiments, which emphasizes the importance of utilizing HPO methods. GS and RS can be seen as baseline models for HPO problems. From the results in Tables 4 to 9, it is shown that the computational time of GS is often much higher than other optimization methods. With the same search space size, RS is faster than GS, but both of them cannot guarantee to detect the nearoptimal hyperparameter configurations of ML models, especially for RF and SVM models which have a larger search space than KNN.
The performance of BO and multifidelity models is much better than GS and RS. The computation time of BOGP is often higher than other HPO methods due to its cubic time complexity, but it can obtain better performance metrics for ML models with smallsize continuous hyperparameter space, like KNN. Conversely, hyperband is often not able to obtain the highest accuracy or the lowest MSE, but their computational time is low because it works on the smallsized subsets. The performance of BOTPE and BOHB are often better than others, since they can detect the optimal or nearoptimal hyperparameter configurations within a short computational time.
For metaheuristics methods, GA and PSO, their accuracies are often higher than other HPO methods for classification problems, and their MSEs are often lower than other optimization techniques. However, their computational time is often higher than BOTPE and multifidelity models, especially for GA, which does not support parallel executions.
To summarize, GS and RS are simple to be implemented, but they often cannot detect the optimal hyperparameter configurations or cost much computational time. BOGP and GA also cost more computational time than other HPO methods, but BOGP works well on small configuration space, while GA is effective for large configuration space. Hyperband’s computational time is low, but it cannot guarantee to detect the global optimums. For ML models with large configuration space, BOTPE, BOHB, and PSO often work well.
8 Open Issues, Challenges, and Future Research Directions
Although there have been many existing HPO algorithms and practical frameworks, some issues still need to be addressed, and several aspects in this domain could be improved. In this section, we discuss the open challenges, current research questions, and potential research directions in the future. They can be classified as model complexity challenges and model performance challenges, which are summarized in Table 10.


Category  Challenges & Future Requirements  Brief Description 


Model complexity  Costly objective function evaluations  HPO methods should reduce evaluation time on large datasets. 
Complex search space  HPO methods should reduce execution time on high dimensionalities (large hyperparameter search space).  
Model performance  Strong anytime performance  HPO methods should be able to detect the optimal or nearoptimal HPs even with a very limited budget. 
Strong final performance  HPO methods should be able to detect the global optimum when given a sufficient budget.  
Comparability  There should exist a standard set of benchmarks to fairly evaluate and compare different optimization algorithms.  
Overfitting and generalization  The optimal HPs detected by HPO methods should have generalizability to build efficient models on unseen data.  
Randomness  HPO methods should reduce randomness on the obtained results.  
Scalability  HPO methods should be scalable to multiple libraries or platforms (e.g., distributed ML platforms).  
Continuous updating capability  HPO methods should consider their capacity to detect and update optimal HP combinations on continuouslyupdated data.  

8.1 Model Complexity
8.1.1 Costly Objective Function Evaluations
To evaluate the performance of a ML model with different hyperparameter configurations, its objective function must be minimized in each evaluation. Depending on the scale of data, the model complexity, and available computational resources, the evaluation of each hyperparameter configuration may cost several minutes, hours, days, or even more HPS . Additionally, the values of certain hyperparameters have a direct impact on the execution time, like the number of considered neighbors in KNN, the number of basic decision trees in RF, and the number of hidden layers in deep neural networks NNetime .
To solve this problem by HPO algorithms, BO models reduce the total number of evaluations by spending time choosing the next evaluating point instead of simply evaluating all possible hyperparameter configurations; however, they still require much execution time due to their poor capacity for parallelization. On the other hand, although multifidelity optimization methods, like Hyperband, have had some success dealing with HPO problems with limited budgets, there are still some problems that cannot be effectively solved by HPO due to the complexity of models or the scale of datasets AMLB
. For example, the ImageNet
Imagenet challenge is a very popular problem in the image processing domain, but there has not been any research or work on efficiently optimizing hyperparameters for the ImageNet challenge yet, due to its huge scale and the complexity of CNN models used on ImageNet.8.1.2 Complex Search Space
In many problems to which ML algorithms are applied, only a few hyperparameters have significant effects on model performance, and they are the main hyperparameters that require tuning. However, certain other unimportant hyperparameters may still affect the performance slightly and may be considered to optimize the ML model further, which increases the dimensionality of hyperparameter search space. As the number of hyperparameters and their values increase, they exponentially increase the dimensionality of the search space and the complexity of the problems, and the total objective function evaluation time will also increase exponentially EHPO . Therefore, it is necessary to reduce the influence of large search spaces on execution time by improving existing HPO methods.
8.2 Model Performance
8.2.1 Strong Anytime Performance and Final Performance
HPO techniques are often expensive and sometimes require extreme resources, especially for massive datasets or complex ML models. One example of a resourceintensive model is deep learning models, since they view objective function evaluations as blackbox functions and do not consider their complexity. However, the overall budget is often very limited for most practical situations, so practical HPO algorithms should be able to prioritize objective function evaluations and have a strong anytime performance, which indicates the capacity to detect optimal or nearoptimal configurations even with a very limited budget BOHB . For instance, an efficient HPO method should have a high convergence speed so that there would not be a huge difference between the results before and after model convergence, and should avoid random results even if time and resources are limited, like RS methods cannot.
On the other hand, if conditions permit and an adequate budget is given, HPO approaches should be able to identify the global optimal hyperparameter configuration, named a strong final performance BOHB .
8.2.2 Comparability of HPO Methods
To optimize the hyperparameters of ML models, different optimization algorithms can be applied to each ML framework. Different optimization techniques have their own strengths and drawbacks in different cases, and currently, there is no single optimization approach that outperforms all other approaches when processing different datasets with various metrics and hyperparameter types AMLSC . In this paper, we have analyzed the strengths and weaknesses of common hyperparameter optimization techniques based on their principles and their performance in practical applications; but this topic could be extended more comprehensively.
To solve this problem, a standard set of benchmarks could be designed and agreed on by the community for a better comparison of different HPO algorithms. For example, there is a platform called COCO (Comparing Continuous Optimizers) COCO that provides benchmarks and analyzes common continuous optimizers. However, there is, to date, not any reliable platform that provides benchmarks and analysis of all common hyperparameter optimization approaches. It would be easier for people to choose HPO algorithms in practical applications if a platform like COCO exists. In addition, a unified metric can also improve the comparability of different HPO algorithms, since different metrics are currently used in different practical problems AMLB .
On the other hand, based on the comparison of different HPO algorithms, a way to further improve HPO is to combine existing models or propose new models that contain as many benefits as possible and are more suitable for practical problems than existing singular models. For example, the BOHB method BOHB has had some success by combining Bayesian optimization and Hyperband. In addition, future research should consider both model performance and time budgets to develop HPO algorithms that suit reallife applications.
8.2.3 Overfitting and Generalization
Generalization is another issue with HPO models. Since hyperparameter evaluations are done with a finite number of evaluations in datasets, the optimal hyperparameter values detected by HPO approaches might not be the same optimums on previouslyunseen data. This is similar to overfitting issues with ML models that occur when a model is closely fit to a finite number of known data points but is unfit to unseen data overfitting . Generalization is also a common concern for multifidelity algorithms, like Hyperband and BOHB, since they need to extract subsets to represent the entire dataset.
One solution to reduce or avoid overfitting is to use crossvalidation to identify a stable optimum that performs best in all or most of the subsets instead of a sharp optimum that only performs well in a singular validation set AMLB . However, crossvalidation increases the execution time severalfold. It would be beneficial if methods can better deal with overfitting and improve generalization in future research.
8.2.4 Randomness
There are stochastic components in the objective function of ML algorithms; thus, in some cases, the optimal hyperparameter configuration might be different after each run. This randomness could be due to various procedures of certain ML models, like neural network initialization, or different sampled subsets of a bagging model HPS ; or due to certain procedures of HPO algorithms, like crossover and mutation operations in GA. In addition, it is often difficult for HPO methods to identify the global optimums, due to the fact that HPO problems are mainly NPhard problems. Many existing HPO algorithms can only collect several different nearoptimal values, which is caused by randomness. Thus, the existing HPO models can be further improved to reduce the impact of randomness. One possible solution is to run a HPO method multiple times and select the hyperparameter value that occurs most as the final optimum.
8.2.5 Scalability
In practice, one main limitation of many existing HPO frameworks is that they are tightly integrated with one or a couple of machine learning libraries, like sklearn and Keras, which restricts them to only work with a single node instead of large data volumes AMLSC . To tackle large datasets, some distributed machine learning platforms, like Apache SystemML SystemML and Spark MLib MLib , have been developed; however, only very few HPO frameworks exist that support distributed ML. Therefore, more research efforts and scalable HPO frameworks, like the ones supporting distributed ML platforms, should be developed to support more libraries.
On the other hand, future practical HPO algorithms should have the scalability to efficiently optimize hyperparameters from a small size to a large size, irrespective of whether they are continuous, discrete, categorical, or conditional hyperparameters.
8.2.6 Continuous Updating Capability
In practice, many datasets are not stationary and are constantly updated by adding new data and deleting old data. Correspondingly, the optimal hyperparameter values or combinations may also change with data changes. Currently, developing HPO methods with the capacity to continuously tune hyperparameter values as the data changes has not drawn much attention, since researchers and data analysts often do not alter the ML model after achieving a currently optimal performance AMLSC . However, since their optimal hyperparameter values would change with data changes, proper approaches should be proposed to achieve continuous updating capability.
9 Conclusion
Machine learning has become the primary strategy for tackling datarelated problems and has been widely used in various applications. To apply ML models to practical problems, their hyperparameters need to be tuned to fit specific datasets. However, since the scale of produced data is greatly increased in reallife, and manually tuning hyperparameters is extremely computationally expensive, it has become crucial to optimize hyperparameters by an automatic process. In this survey paper, we have comprehensively discussed the stateoftheart research into the domain of hyperparameter optimization as well as how to apply them to different ML models by theory and practical experiments. To apply optimization methods to ML models, the hyperparameter types in a ML model is the main concern for HPO method selection. To summarize, BOHB is the recommended choice for optimizing a ML model, if randomly selected subsets are highlyrepresentative of the given dataset, since it can efficiently optimize all types of hyperparameters; otherwise, BO models are recommended for small hyperparameter configuration space, while PSO is usually the best choice for large configuration space. Moreover, some existing useful HPO tools and frameworks, open challenges, and potential research directions are also provided and highlighted for practical use and future research purposes. We hope that our survey paper serves as a useful resource for ML users, developers, data analysts, and researchers to use and tune ML models utilizing proper HPO techniques and frameworks. We also hope that it helps to enhance understanding of the challenges that still exist within the HPO domain, and thereby further advance HPO and ML applications in future research.
References
 (1) M.I. Jordan, T.M. Mitchell, Machine learning: Trends, perspectives, and prospects, Science 349 (2015) 255–260. https://doi.org/10.1126/science.aaa8415.
 (2) M.A. Zöller and M. F. Huber, Benchmark and Survey of Automated Machine Learning Frameworks, arXiv preprint arXiv:1904.12054, (2019). https://arxiv.org/abs/1904.12054.
 (3) R. E. Shawi, M. Maher, S. Sakr, Automated machine learning: Stateoftheart and open challenges, arXiv preprint arXiv:1906.02287, (2019). http://arxiv.org/abs/1906.02287.
 (4) M. Kuhn and K. Johnson, Applied Predictive Modeling, Springer (2013) ISBN: 9781461468493.
 (5) G.I. Diaz, A. FokoueNkoutche, G. Nannicini, H. Samulowitz, An effective algorithm for hyperparameter optimization of neural networks, IBM J. Res. Dev. 61 (2017) 1–20. https://doi.org/10.1147/JRD.2017.2709578.
 (6) F. Hutter, L. Kotthoff, and J. Vanschoren, Eds., Automatic Machine Learning: Methods, Systems, Challenges, Springer (2019) ISBN: 9783030053185.
 (7) N. DecastroGarcía, Á. L. Muñoz Castañeda, D. Escudero García, and M. V. Carriegos, Effect of the Sampling of a Dataset in the Hyperparameter Optimization Phase over the Efficiency of a Machine Learning Algorithm, Complexity 2019 (2019). https://doi.org/10.1155/2019/6278908.
 (8) S. Abreu, Automated Architecture Design for Deep Neural Networks, arXiv preprint arXiv:1908.10714, (2019). http://arxiv.org/abs/1908.10714.
 (9) O. S. Steinholtz, A Comparative Study of Blackbox Optimization Algorithms for Tuning of Hyperparameters in Deep Neural Networks, M.S. thesis, Dept. Elect. Eng., Luleå Univ. Technol., (2018).
 (10) G. Luo, A review of automatic selection methods for machine learning algorithms and hyperparameter values, Netw. Model. Anal. Heal. Informatics Bioinforma. 5 (2016) 1–16. https://doi.org/10.1007/s1372101601256.
 (11) D. Maclaurin, D. Duvenaud, R.P. Adams, Gradientbased Hyperparameter Optimization through Reversible Learning, arXiv preprint arXiv:1502.03492, (2015). http://arxiv.org/abs/1502.03492.
 (12) J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, Algorithms for hyperparameter optimization, Proc. Adv. Neural Inf. Process. Syst., (2011) 2546–2554.
 (13) B. James and B. Yoshua, Random Search for HyperParameter Optimization, J. Mach. Learn. Res. 13 (1) (2012) 281–305.
 (14) K. Eggensperger, M. Feurer, F. Hutter, J. Bergstra, J. Snoek, H. Hoos, K. LeytonBrown, Towards an Empirical Foundation for Assessing Bayesian Optimization of Hyperparameters, BayesOpt Work. (2013) 1–5.
 (15) K. Eggensperger, F. Hutter, H.H. Hoos, K. LeytonBrown, Efficient benchmarking of hyperparameter optimizers via surrogates, Proc. Natl. Conf. Artif. Intell. 2 (2015) 1114–1120.
 (16) L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar, Hyperband: A novel banditbased approach to hyperparameter optimization, J. Mach. Learn. Res. 18 (2012) 1–52.
 (17) Q. Yao et al., Taking Human out of Learning Applications: A Survey on Automated Machine Learning, arXiv preprint arXiv:1810.13306, (2018). http://arxiv.org/abs/1810.13306.
 (18) S. Lessmann, R. Stahlbock, S.F. Crone, Optimizing hyperparameters of support vector machines by genetic algorithms, Proc. 2005 Int. Conf. Artif. Intell. ICAI’05. 1 (2005) 74–80.
 (19) P. R. Lorenzo, J. Nalepa, M. Kawulok, L. S. Ramos, and J. R. Paster, Particle swarm optimization for hyperparameter selection in deep neural networks, Proc. ACM Int. Conf. Genet. Evol. Comput., (2017) 481–488.
 (20) S. Sun, Z. Cao, H. Zhu, J. Zhao, A Survey of Optimization Methods from a Machine Learning Perspective, arXiv preprint arXiv:1906.06821, (2019). https://arxiv.org/abs/1906.06821.
 (21) T.M. S. Bradley, A. Hax, Applied Mathematical Programming, AddisonWesley, Reading, Massachusetts. (1977).
 (22) S. Bubeck, Convex optimization: Algorithms and complexity, Found. Trends Mach. Learn. 8 (2015) 231–357. https://doi.org/10.1561/2200000050.
 (23) B. Shahriari, A. BouchardCôté, and N. de Freitas, “Unbounded Bayesian optimization via regularization,” Proc. Artif. Intell. Statist., (2016) 1168–1176.
 (24) G.I. Diaz, A. FokoueNkoutche, G. Nannicini, H. Samulowitz, An effective algorithm for hyperparameter optimization of neural networks, IBM J. Res. Dev. 61 (2017) 1–20. https://doi.org/10.1147/JRD.2017.2709578.
 (25) C. Gambella, B. Ghaddar, and J. NaoumSawaya, Optimization Models for Machine Learning: A Survey, arXiv preprint arXiv:1901.05331, (2019). http://arxiv.org/abs/1901.05331.
 (26) E. R. Sparks, A. Talwalkar, D. Haas, M. J. Franklin, M. I. Jordan, and T. Kraska, Automating model search for large scale machine learning, Proc. 6th ACM Symp. Cloud Comput., (2015) 368–380.
 (27) J. Nocedal and S. Wright, Numerical Optimization, (2006) SpringerVerlag, ISBN: 9780387400655.
 (28) A. Moubayed, M. Injadat, A. Shami, H. Lutfiyya, DNS TypoSquatting Domain Detection: A Data Analytics & Machine Learning Based Approach, 2018 IEEE Glob. Commun. Conf. GLOBECOM 2018  Proc. (2018). https://doi.org/10.1109/GLOCOM.2018.8647679.
 (29) R. Caruana, A. NiculescuMizil, An empirical comparison of supervised learning algorithms, ACM Int. Conf. Proceeding Ser. 148 (2006) 161–168. https://doi.org/10.1145/1143844.1143865.
 (30) O. Kramer, ScikitLearn, in Machine Learning for Evolution Strategies. Cham, Switzerland: Springer International Publishing, (2016) 45–53.
 (31) F. Pedregosa et al., Scikitlearn: Machine learning in Python, J. Mach. Learn. Res., 12 (2011) 2825–2830.
 (32) T.Chen, C.Guestrin, XGBoost: a scalable tree boosting system, arXiv preprint arXiv:1603.02754, (2016). http://arxiv.org/abs/1603.02754.
 (33) F. Chollet, Keras, 2015. https://github.com/fchollet/keras.
 (34) C. Gambella, B. Ghaddar, J. NaoumSawaya, Optimization Models for Machine Learning: A Survey, (2019) 1–40. http://arxiv.org/abs/1901.05331

(35)
C.M. Bishop, Pattern Recognition and Machine Learning. (2006) Springer, ISBN: 9780387310732.
 (36) A.E. Hoerl, R.W. Kennard, Ridge Regression: Applications to Nonorthogonal Problems, Technometrics. 12 (1970) 69–82. https://doi.org/10.1080/00401706.1970.10488635.
 (37) L.E. Melkumova, S.Y. Shatskikh, Comparing Ridge and LASSO estimators for data analysis, Procedia Eng. 201 (2017) 746–755. https://doi.org/10.1016/j.proeng.2017.09.615.
 (38) R. Tibshirani, Regression Shrinkage and Selection Via the Lasso, J. R. Stat. Soc. Ser. B. 58 (1996) 267–288. https://doi.org/10.1111/j.25176161.1996.tb02080.x.
 (39) D.W. Hosmer Jr, S. Lemeshow, Applied logistic regression, Technometrics, 34 (1) (2013), 358359.
 (40) J.O. Ogutu, T. SchulzStreeck, H.P. Piepho, Genomic selection using regularized linear regression models: ridge regression, lasso, elastic net and their extensions, BMC Proceedings. BioMed Cent. 6 (2012).
 (41) J.M. Keller, M.R. Gray, A Fuzzy KNearest Neighbor Algorithm, IEEE Trans. Syst. Man Cybern. SMC15 (1985) 580–585. https://doi.org/10.1109/TSMC.1985.6313426.
 (42) W. Zuo, D. Zhang, K. Wang, On kernel differenceweighted knearest neighbor classification, Pattern Anal. Appl. 11 (2008) 247–257. https://doi.org/10.1007/s100440070100z.
 (43) A. Smola, V. Vapnik, Support vector regression machines, Adv. Neural Inf. Process. Syst. 9 (1997) 155161.
 (44) L. Yang, R. Muresan, A. AlDweik, L.J. Hadjileontiadis, ImageBased Visibility Estimation Algorithm for Intelligent Transportation Systems, IEEE Access. 6 (2018) 76728–76740. https://doi.org/10.1109/ACCESS.2018.2884225.
 (45) L. Yang, Comprehensive Visibility Indicator Algorithm for Adaptable Speed Limit Control in Intelligent Transportation Systems, M.A.Sc. thesis, University of Guelph, 2018.
 (46) O.S. Soliman, A.S. Mahmoud, A classification system for remote sensing satellite images using support vector machine with nonlinear kernel functions, 2012 8th Int. Conf. Informatics Syst. INFOS 2012. (2012) BIO181BIO187.

(47)
I. Rish, An empirical study of the naive Bayes classifier, IJCAI 2001 Work. Empir. methods Artif. Intell., (2001), 4146.
 (48) J.N. Sulzmann, J. Fürnkranz, E. Hüllermeier, On pairwise naive bayes classifiers, Lect. Notes Comput. Sci. (Including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics). 4701 LNAI (2007) 371–381. https://doi.org/10.1007/9783540749585_35.
 (49) C. Bustamante, L. Garrido, R. Soto, Comparing fuzzy Naive Bayes and Gaussian Naive Bayes for decision making in RoboCup 3D, Lect. Notes Comput. Sci. (Including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics). 4293 LNAI (2006) 237–247. https://doi.org/10.1007/11925231_23.
 (50) A.M. Kibriya, E. Frank, B. Pfahringer, G. Holmes, Multinomial naive bayes for text categorization revisited, Lect. Notes Artif. Intell. (Subseries Lect. Notes Comput. Sci. 3339 (2004) 488–499.
 (51) J.D.M. Rennie, L. Shih, J. Teevan, D.R. Karger Tackling the poor assumptions of Naive Bayes text classifiers, Proc. Twent. Int. Conf. Mach. Learn. ICML (2003), 616623.
 (52) V. Narayanan, I. Arora, and A. Bhatia, Fast and accurate sentiment classification using an enhanced naïve Bayes model, arXiv preprint arXiv:1305.6143, (2013). https://arxiv.org/abs/1305.6143.
 (53) S. Rasoul, L. David, A Survey of Decision Tree Classifier Methodology, IEEE Trans. Syst. Man. Cybern. 21 (1991) 660–674.
 (54) D.M. Manias, M. Jammal, H. Hawilo, A. Shami, P. Heidari, A. Larabi, R. Brunner, Machine Learning for Performanceaware Virtual Network Function Placement, 2019 IEEE Glob. Commun. Conf. GLOBECOM 2019  Proc. (2019) 12–17. https://doi.org/10.1109/GLOBECOM38437.2019.9013246.
 (55) L. Yang, A. Moubayed, I. Hamieh, A. Shami, Treebased intelligent intrusion detection system in internet of vehicles, 2019 IEEE Glob. Commun. Conf. GLOBECOM 2019  Proc. (2019). https://doi.org/10.1109/GLOBECOM38437.2019.9013892.
 (56) S. Sanders, C. GiraudCarrier, Informing the use of hyperparameter optimization through metalearning, Proc.  IEEE Int. Conf. Data Mining, ICDM. 2017Novem (2017) 1051–1056. https://doi.org/10.1109/ICDM.2017.137.

(57)
M. Injadat, F. Salo, A.B. Nassif, A. Essex, A. Shami, Bayesian Optimization with Machine Learning Algorithms Towards Anomaly Detection, 2018 IEEE Glob. Commun. Conf. (2018) 1–6.
https://doi.org/10.1109/glocom.2018.8647714.  (58) F. Salo, M.N. Injadat, A. Moubayed, A.B. Nassif, A. Essex, Clustering Enabled Classification using Ensemble Feature Selection for Intrusion Detection, 2019 Int. Conf. Comput. Netw. Commun. ICNC 2019. (2019) 276–281. https://doi.org/10.1109/ICCNC.2019.8685636.
 (59) K. Arjunan, C.N. Modi, An enhanced intrusion detection framework for securing network layer of cloud computing, ISEA Asia Secur. Priv. Conf. 2017, ISEASP 2017. (2017) 1–10. https://doi.org/10.1109/ISEASP.2017.7976988.
 (60) Y. Xia, C. Liu, Y.Y. Li, N. Liu, A boosted decision tree approach using Bayesian hyperparameter optimization for credit scoring, Expert Syst. Appl. 78 (2017) 225–241. https://doi.org/10.1016/j.eswa.2017.02.017.
 (61) T. G. Dietterich, Ensemble methods in machine learning, Mult. Classif. Syst., 1857 (2000), 115.
 (62) A. Moubayed, E. Aqeeli, A. Shami, Ensemblebased Feature Selection and Classification Model for DNS Typosquatting Detection, in: 2020 IEEE Can. Conf. Electr. Comput. Eng., 2020.
 (63) W. Yin, K. Kann, M. Yu, and H. Schütze, Comparative Study of CNN and RNN for Natural Language Processing, arXiv preprint arXiv:1702.01923, (2017). https://arxiv.org/abs1702.01923
 (64) A. Koutsoukas, K.J. Monaghan, X. Li, J. Huan, Deeplearning: Investigating deep neural networks hyperparameters and comparison of performance to shallow methods for modeling bioactivity data, J. Cheminform. 9 (2017) 1–13. https://doi.org/10.1186/s133210170226y.
 (65) T. Domhan, J.T. Springenberg, F. Hutter, Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves, IJCAI Int. Jt. Conf. Artif. Intell. 2015January (2015) 3460–3468.
 (66) Y. Ozaki, M. Yano, M. Onishi, Effective hyperparameter optimization using NelderMead method in deep learning, IPSJ Trans. Comput. Vis. Appl. 9 (2017). https://doi.org/10.1186/s4107401700307.
 (67) F.C. Soon, H.Y. Khaw, J.H. Chuah, J. Kanesan, Hyperparameters optimisation of deep CNN architecture for vehicle logo recognition, IET Intell. Transp. Syst. 12 (2018) 939–946. https://doi.org/10.1049/ietits.2018.5127.
 (68) D. Han, Q. Liu, W. Fan, A new image classification method using CNN transfer learning and web data augmentation, Expert Syst. Appl. 95 (2018) 43–56. https://doi.org/10.1016/j.eswa.2017.11.028.
 (69) C. Di Francescomarino, M. Dumas, M. Federici, C. Ghidini, F.M. Maggi, W. Rizzi, L. Simonetto, Genetic algorithms for hyperparameter optimization in predictive business process monitoring, Inf. Syst. 74 (2018) 67–83. https://doi.org/10.1016/j.is.2018.01.003.
 (70) A. Moubayed, M. Injadat, A. Shami, H. Lutfiyya, Student Engagement Level in eLearning Environment: Clustering Using Kmeans, Am. J. Distance Educ. 34 (2020) 1–20. https://doi.org/10.1080/08923647.2020.1696140.
 (71) T. K. Moon, The expectationmaximization algorithm, IEEE Signal Process. Mag. 13 (6) (1996) 47–60.
 (72) S. BrahimBelhouari, A. Bermak, M. Shi, P.C.H. Chan, Fast and Robust gas identification system using an integrated gas sensor technology and Gaussian mixture models, IEEE Sens. J. 5 (2005) 1433–1444. https://doi.org/10.1109/JSEN.2005.858926.
 (73) Z. Y., K. G., Hierarchical Clustering Algorithms for Document Dataset, Data Min. Knowl. Discov. 10 (2005) 141–168.
 (74) K. Khan, S.U. Rehman, K. Aziz, S. Fong, S. Sarasvady, A. Vishwa, DBSCAN: Past, present and future, 5th Int. Conf. Appl. Digit. Inf. Web Technol. ICADIWT 2014. (2014) 232–238. https://doi.org/10.1109/ICADIWT.2014.6814687.
 (75) H. Zhou, P. Wang, H. Li, Research on adaptive parameters determination in DBSCAN algorithm, J. Inf. Comput. Sci. 9 (2012) 1967–1973.
 (76) J. Shlens, A Tutorial on Principal Component Analysis, arXiv preprint arXiv:1404.1100, (2014). https://arxiv.org/abs1404.1100
 (77) N. Halko, P. Martinsson, J. Tropp, Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions, SIAM Rev. 53 (2) (2011), pp. 217288
 (78) M. Loog, Conditional linear discriminant analysis, Proc.  Int. Conf. Pattern Recognit. 2 (2006) 387–390. https://doi.org/10.1109/ICPR.2006.402.

(79)
P. Howland, J. Wang, H. Park, Solving the small sample size problem in face recognition using generalized discriminant analysis, Pattern Recognit. 39 (2006) 277–287.
https://doi.org/10.1016/j.patcog.2005.06.013.  (80) I. Ilievski, T. Akhtar, J. Feng, C.A. Shoemaker, Efficient hyperparameter optimization of deep learning algorithms using deterministic RBF surrogates, 31st AAAI Conf. Artif. Intell. AAAI 2017. (2017) 822–829.
 (81) M.N. Injadat, A. Moubayed, A.B. Nassif, A. Shami, Systematic Ensemble Model Selection Approach for Educational Data Mining, KnowledgeBased Syst. 200 (2020) 105992. https://doi.org/10.1016/j.knosys.2020.105992.
 (82) M. Injadat, A. Moubayed, A.B. Nassif, A. Shami, Multisplit Optimized Bagging Ensemble Model Selection for Multiclass Educational Data Mining, Springer’s Appl. Intell. (2020).
 (83) M. Claesen, J. Simm, D. Popovic, Y. Moreau, and B. De Moor, Easy Hyperparameter Search Using Optunity, arXiv preprint arXiv:1412.1114, (2014). https://arxiv.org/abs1412.1114.
 (84) C. Witt, Worstcase and averagecase approximations by simple randomized search heuristics, in: Proceedings of the 22nd Annual Symposium on Theoretical Aspects of Computer Science, STACS’05, Stuttgart, Germany, 2005, pp. 44–56.
 (85) Y. Bengio, Gradientbased optimization of hyperparameters, Neural Comput. 12 (8) (2000) 18891900.

(86)
H. H. Yang and S. I. Amari, Complexity Issues in Natural Gradient Descent Method for Training Multilayer Perceptrons, Neural Comput. 10 (8) (1998) 2137–2157.
 (87) J. Snoek, H. Larochelle, R. Adams Practical Bayesian optimization of machine learning algorithms Adv. Neural Inf. Process. Syst. 4 (2012), 29512959.
 (88) E. Hazan, A. Klivans, and Y. Yuan, Hyperparameter optimization: a spectral approach, arXiv preprint arXiv:1706.00764, (2017). https://arxiv.org/abs1706.00764.
 (89) M. Seeger, Gaussian processes for machine learning, Int. J. Neural Syst., 14 (2004), 69106.
 (90) F. Hutter, H. H. Hoos, and K. LeytonBrown, Sequential modelbased optimization for general algorithm configuration, Proc. LION 5, (2011) 507523.
 (91) I. Dewancker, M. McCourt, S. Clark, Bayesian Optimization Primer, (2015) URL: https://sigopt.com/static/pdf/SigOpt Bayesian Optimization Primer.pdf
 (92) J. Hensman, N. Fusi, and N. D. Lawrence, Gaussian processes for big data, arXiv preprint arXiv:1309.6835, (2013). https://arxiv.org/abs/1309.6835.
 (93) M. Claesen and B. De Moor, Hyperparameter Search in Machine Learning, arXiv preprint arXiv:1502.02127, (2015). https://arxiv.org/abs1502.02127.
 (94) L. Bottou, Largescale machine learning with stochastic gradient descent, Proceedings of the COMPSTAT, Springer (2010) 177186.
 (95) S. Zhang, J. Xu, E. Huang, C.H. Chen, A new optimal sampling rule for multifidelity optimization via ordinal transformation, IEEE Int. Conf. Autom. Sci. Eng. 2016Novem (2016) 670–674. https://doi.org/10.1109/COASE.2016.7743467.
 (96) Z. Karnin, T. Koren, O. Somekh, Almost optimal exploration in multiarmed bandits, 30th Int. Conf. Mach. Learn. ICML 2013. 28 (2013) 2275–2283.
 (97) S. Falkner, A. Klein, F. Hutter, BOHB: Robust and Efficient Hyperparameter Optimization at Scale, 35th Int. Conf. Mach. Learn. ICML 2018. 4 (2018) 2323–2341.
 (98) A. Gogna, A. Tayal, Metaheuristics: Review and application, J. Exp. Theor. Artif. Intell. 25 (2013) 503–526. https://doi.org/10.1080/0952813X.2013.782347.
 (99) F. Itano, M.A. De Abreu De Sousa, E. DelMoralHernandez, Extending MLP ANN hyperparameters Optimization by using Genetic Algorithm, Proc. Int. Jt. Conf. Neural Networks. 2018July (2018) 1–8. https://doi.org/10.1109/IJCNN.2018.8489520.
 (100) B. Kazimipour, X. Li, A.K. Qin, A Review of Population Initialization Techniques for Evolutionary Algorithms, 2014 IEEE Congr. Evol. Comput. (2014) 2585–2592. https://doi.org/10.1109/CEC.2014.6900618.
 (101) S. Rahnamayan, H.R. Tizhoosh, M.M.A. Salama, A novel population initialization method for accelerating evolutionary algorithms, Comput. Math. with Appl. 53 (2007) 1605–1614. https://doi.org/10.1016/j.camwa.2006.07.013.
 (102) F. G. Lobo, D. E. Goldberg, and M. Pelikan, Time complexity of genetic algorithms on exponentially scaled problems, Proc. Genet. Evol. Comput. Conf., (2000) 151158.
 (103) Y. Shi, R.C. Eberhart, Parameter Selection in Particle Swarm Optimization, Evolutionary Programming VII, Springer (1998) 591600.
 (104) X. Yan, F. He, Y. Chen, A Novel Hardware / Software Partitioning Method Based on Position Disturbed Particle Swarm Optimization with Invasive Weed Optimization, 32 (2017) 340–355. https://doi.org/10.1007/s1139001717142.
 (105) M.Y. Cheng, K.Y. Huang, M. Hutomo, Multiobjective DynamicGuiding PSO for Optimizing Work Shift Schedules, J. Constr. Eng. Manag. 144 (2018) 1–7. https://doi.org/10.1061/(ASCE)CO.19437862.0001548.
 (106) H. Wang, Z. Wu, J. Wang, X. Dong, S. Yu, G. Chen, A new population initialization method based on space transformation search, 5th Int. Conf. Nat. Comput. ICNC 2009. 5 (2009) 332–336. https://doi.org/10.1109/ICNC.2009.371.
 (107) J. Wang, J. Xu, and X. Wang, Combination of Hyperband and Bayesian Optimization for Hyperparameter Optimization in Deep Learning, arXiv preprint arXiv:1801.01596, (2018). https://arxiv.org/abs1801.01596.
 (108) P. Cazzaniga, M.S. Nobile, D. Besozzi, The impact of particles initialization in PSO: Parameter estimation as a case in point, 2015 IEEE Conf. Comput. Intell. Bioinforma. Comput. Biol. CIBCB 2015. (2015) 1–8. https://doi.org/10.1109/CIBCB.2015.7300288.
 (109) R. MartinezCantin, BayesOpt: A Bayesian optimization library for nonlinear optimization, experimental design and bandits, J. Mach. Learn. Res. 15 (2015) 3735–3739.
 (110) J. Bergstra, B. Komer, C. Eliasmith, D. Yamins, D.D. Cox, Hyperopt: A Python library for model selection and hyperparameter optimization, Comput. Sci. Discov. 8 (2015). https://doi.org/10.1088/17494699/8/1/014008.
 (111) B. Komer, J. Bergstra, and C. Eliasmith, Hyperoptsklearn: Automatic hyperparameter configuration for scikitlearn, Proc. ICML Workshop AutoML, (2014) 34–40.
 (112) M. Pumperla, Hyperas, 2019. http://maxpumperla.com/hyperas/.
 (113) M. Lindauer, K. Eggensperger, M. Feurer, S. Falkner, A. Biedenkapp, and F. Hutter, Smac v3: Algorithm configuration in python, 2017. https://github.com/automl/SMAC3.
 (114) Tim Head, MechCoder, Gilles Louppe, et al., scikitoptimize/scikitoptimize: v0.5.2, 2018. https://doi.org/10.5281/zenodo.1207017.
 (115) N. Knudde, J. van der Herten, T. Dhaene, and I. Couckuyt, GPflowOpt: A Bayesian Optimization Library using TensorFlow, arXiv preprint arXiv:1711.03845, (2017). https://arxiv.org/abs1711.03845.
 (116) Autonomio Talos [Computer software], 2019. http://github.com/autonomio/talos.
 (117) L. Hertel, P. Sadowski, J. Collado, P. Baldi, Sherpa: Hyperparameter Optimization for Machine Learning Models, Conf. Neural Inf. Process. Syst. (2018).
 (118) M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, et al., TensorFlow: LargeScale Machine Learning on Heterogeneous Distributed Systems, arXiv preprint arXiv:1603.04467, (2016). https://arxiv.org/abs1603.04467.
 (119) J. Grandgirard, D. Poinsot, L. Krespi, J.P. Nénon, A.M. Cortesero, Osprey: Hyperparameter Optimization for Machine Learning, 103 (2002) 239–248. https://doi.org/10.21105/joss.00034.
 (120) L. Franceschi, M. Donini, P. Frasconi, and M. Pontil, Forward and reverse gradientbased hyperparameter optimization, 34th Int. Conf. Mach. Learn. ICML 2017, 70 (2017) 11651173.
 (121) F.A. Fortin, F.M. De Rainville, M.A. Gardner, M. Parizeau, C. Gagńe, DEAP: Evolutionary algorithms made easy, J. Mach. Learn. Res. 13 (2012) 2171–2175.
 (122) R. S. Olson and J. H. Moore, TPOT: A treebased pipeline optimization tool for automating machine learning, Auto Mach. Learn. (2019) 151160. https://doi.org/10.1007/9783030053185_8
 (123) J. Rapin and O. Teytaud, Nevergrad  A gradientfree optimization platform, 2018. https://GitHub.com/FacebookResearch/Nevergrad.
 (124) L. Yang and A. Shami, Hyperparameter Optimization of Machine Learning Algorithms, 2020. https://github.com/LiYangHart/HyperparameterOptimizationofMachineLearningAlgorithms.
 (125) C.M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press (1995).
 (126) A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, Adv. Neural Inf. Process. Syst. 25 (2012) 10971105
 (127) N. Hansen, A. Auger, O. Mersmann, T. Tusar, and D. Brockhoff, COCO: A Platform for Comparing Continuous Optimizers in a BlackBox Setting, arXiv preprint arXiv:1603.08785, (2016). https://arxiv.org/abs1603.08785.
 (128) G.C. Cawley, N.L.C. Talbot, On overfitting in model selection and subsequent selection bias in performance evaluation, J. Mach. Learn. Res. 11 (2010) 2079–2107.
 (129) M. Boehm, A. Surve, S. Tatikonda, et al., SystemML: declarative machine learning on spark, Proc. VLDB Endow. 9 (2016) 1425–1436. https://doi.org/10.14778/3007263.3007279.
 (130) X. Meng, J. Bradley, B. Yavuz, et al., Mllib: machine learning in apache spark, J. Mach. Learn. Res. 17 (1) (2016) 12351241.
Comments
There are no comments yet.