Multi-level CNN for lung nodule classification with Gaussian Process assisted hyperparameter optimization

01/02/2019 ∙ by Miao Zhang, et al. ∙ 0

This paper investigates lung nodule classification by using deep neural networks (DNNs). Hyperparameter optimization in DNNs is a computationally expensive problem, where evaluating a hyperparameter configuration may take several hours or even days. Bayesian optimization has been recently introduced for the automatically searching of optimal hyperparameter configurations of DNNs. It applies probabilistic surrogate models to approximate the validation error function of hyperparameter configurations, such as Gaussian processes, and reduce the computational complexity to a large extent. However, most existing surrogate models adopt stationary covariance functions to measure the difference between hyperparameter points based on spatial distance without considering its spatial locations. This distance-based assumption together with the condition of constant smoothness throughout the whole hyperparameter search space clearly violates the property that the points far away from optimal points usually get similarly poor performance even though each two of them have huge spatial distance between them. In this paper, a non-stationary kernel is proposed which allows the surrogate model to adapt to functions whose smoothness varies with the spatial location of inputs, and a multi-level convolutional neural network (ML-CNN) is built for lung nodule classification whose hyperparameter configuration is optimized by using the proposed non-stationary kernel based Gaussian surrogate model. Our algorithm searches the surrogate for optimal setting via hyperparameter importance based evolutionary strategy, and the experiments demonstrate our algorithm outperforms manual tuning and well-established hyperparameter optimization methods such as Random search, Gaussian processes with stationary kernels, and recently proposed Hyperparameter Optimization via RBF and Dynamic coordinate search.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Lung cancer is a notoriously aggressive cancer with sufferers having an average 5-year survival rate 18% and a mean survival time of less than 12 months Siegel2017Colorectal

, and early diagnosis is very important to improve the survival rate. Recently, deep learning has shown its supriority in computer vision

6909475 ; Krizhevsky2012ImageNet ; Naiyan2015Transferring , and more researchers try to diagnose lung cancers with deep neural networks to assist the early diagnosis as Computer Aided Diagnosis (CAD) systems Anthimopoulos2016Lung ; Golan2016Lung ; Song2017Using . In our previous works lyvjuan , a multi-level convolutional neural networks (ML-CNN) is proposed to handle lung nodule malignancy classification, which extracts multi-scale features through different convolutional kernel sizes. Our ML-CNN achieves state-of-art accuracies both in binary and ternary classification (which achieves 92.21% and 84.81%, respectively) without any preprocessing. However, the experiments also demonstrate the performance is very sensitive to hyperparameter configuration, especially the number of feature maps in every convolutional layer, where we obtain the near-optimal hyperparameter configuration through trial and error.

Automatically hyperparameter optimization is very crucial to apply deep learning algorithms in practice, and several methods including Grid search Lecun1998Efficient , Random search Bergstra2012Random

, Tree-structured Parzen Estimator Approach (TPE)

Bergstra2011Algorithms and Bayesian optimizationSnoek2012Practical , have shown their superiority than manual search method in hyperparameters optimization of deep neural network. Hyperparameter optimization in deep neural networks is a global optimization with black-box and expensive function, where evaluating a hyperparameter choice may cost several hours or even days. It is a computational expensive problem, and a popular solution is to employ probabilistic surrogate, such as Gaussian processes (GP) and Tree-structured Parzen Estimator (TPE), to approximate the expensive error function to guide the optimization process. A stationary covariance function (kernel) is usually used in these surrogates, which depends only on the spatial distance of two hyperparameter configurations, but not on the hyperparameters themselves. Such covariance function that employs constant smoothness throughout the hyperparameter search space clearly violates the intuition that most points away from optimal point all get similarly poor performance even though each two of them have large spatial distance.

In this paper, the deep neural network for lung nodule classification is built based on multi-level convolutional neural networks, which designs three levels of CNNs with same structure but different convolutional kernel sizes to extract multi-scale features of input with variable nodule sizes and morphologies. Then the hyperparameter optimization in deep convolutional neural network is formulated as an expensive optimization problem, and a Gaussian surrogate model based on non-stationary kernel is built to approximate the error function of hyperparameter configurations, which allows the model to adapt to functions whose smoothness varies with the inputs. Our algorithm searches the surrogate via hyperparameter importance based evolutionary strategy and could find the near-optimal hyperparameter setting in limited function evaluations.

We name our algorithm as Hyperparameter Optimization with sUrrogate-aSsisted Evolutionary Strategy, or HOUSES for short. We have compared our algorithm against several different well-established hyperparameter optimization algorithms, including Random search, Gaussian Process with stationary kernels, and Hyperparameter Optimization via RBF and Dynamic coordinate search (HORD) Ilija . The main contribution of our paper is summarized as fourfold:

  1. A multi-level convolutional neural network is adopted for lung nodule malignancy classification, whose hyperparameter optimization is formulated as a computational expensive optimization problem.

  2. A surrogate-assisted evolutionary strategy is introduced as the framework to solve the hyperparameter optimization for ML-CNN, which utilizes a hyperparameter importance based mutation as sampling method for efficient candidate points generation.

  3. A non-stationary kernel is proposed as covariance function to define the relationship between different hyperparameter configurations, which allows the model adapt spatial dependence structure to vary with a function of location. Different with a constant smoothness throughout the whole sampling region, our non-stationary GP regression model could satisfy the assumption that the correlation function is no longer dependent on distance only, but also dependent on their relative locations to the optimal point. An input-warping method is also adopted which makes covariance functions more sensitive near the hyperparameter optimums.

  4. Extensive experiments illustrate the superiority of our proposed HOSUE for hyperparaneter optimization of deep neural networks.

We organise this paper as follows: Section II introduces the background about lung nodule classification, hyperparameter optimization in deep neural network and surrogate-assisted evolutionary algorithm. Section III describes the proposed non-stationary covariance function for hyperparameter optimization in deep neural network and the framework and details of

Hyperparameter Optimization with sUrrogate-aSsisted Evolutionary Strategy (HOUSES) for ML-CNN. The experimental design is described in Section IV , and we demonstrates the experimental results with discussions for state-of-the-art hyperparameter optimization approaches in Section V. We conclude and describe the future work in Section VI.

2 Relate Works

2.1 Lung Nodule Classification with deep neural network

Deep neural networks have shown their superiority to conventional algorithms in the application of computer vision, and more researchers try to employ DNNs to medical imaging diagnosis areas. Paper Sun2016Computer

presents different deep structure algorithms in lung cancer diagnosis, including stacked denoising autoencoder, deep belief network, and convolutional neural network,who obtain the binary classification accuracies 79.76%, 81.19% and 79.29%, respectively. Shen et al.

Shen2015Multi proposed a Multi-scale Convolutional Neural Networks (MCNN), that utilized multi-scale nodule patches to sufficiently quantify nodule characteristics, which obtained binary classification accuracy of 86.84%. In MCNN, three CNNs that took different nodule as inputs were assembled in parallel, and concatenated the output of each fully-connected layers as its resulting output. The experiments had shown that multi-scale inputs could help CNN learn a set of discriminative features. In 2017, they extended their research and proposed a multi-crop CNN (MC-CNN) Shen2017Multi which automatically extracted nodule fetures by adopting a multi-crop pooling strategy, and obtained 87.14% binary classification and 62.46% ternary classification accuracy. In our previous works lyvjuan

, a multi-level convolutional neural networks (ML-CNN) is proposed which extracts multi-scale features through different convolutional kernel sizes. It also designs three CNNs with same structure but different convolutional kernel sizes to extract multi-scale features with variable nodule sizes and morphologies. Our ML-CNN achieves state-of-art accuracies both in binary and ternary classification (which achieves 92.21% and 84.81%, respectively), without any additional hand-craft preprocessing. Even though these deep learning methods were end-to-end machine learning architectures and had shown their superiority than conventional methods, the structure design and hyperparameter configuration are based on human expert’s experience through trial and error search guided by human’s intuition, which is a difficult and time consuming task

Negrinho2017DeepArchitect ; Dong2018A .

2.2 Hyperparameter optimization in DNN

Determining appropriate values of hyperparameters of DNN is a frustratingly difficult task where all feasible hyperparameter configurations form a huge space, from which we need to choose the optimal case. Setting correct hyperparameters is often critical for reaching the full potential of the deep neural network chosen or designed, otherwise it may severely hamper the performance of deep neural networks.

Hyperparameter optimization in DNN is a global optimization to find a -dimensional hyperparameter setting x that minimize the validation error f of a DNN with learned parameters . The optimal x could be obtained through optimizing f as follows:

(1)

where and are training and validation datasets respectively. Solving Eq.(1) is very challenging for the high complexity of the function f, and it is usually accomplished manually in the deep learning community, which largely depends on expert’s experience or intuition. It is also hard to reproduce similar results when this configuration is applied on different datasets or problems.

There are several systematic approach to tune hyperparameters in machine learning community, like Grid search, Random search, Bayesian optimization methods, and so on. Grid search is the most common strategy in hyper-parameter optimization Lecun1998Efficient , and it is simple to implement with parallelization, which makes it reliable in low dimensional spaces (e.g., 1-, 2-

). However, Grid search suffer from the curse of dimensionality because the search space grows exponentially with the number of hyper-parameters. Random search

Bergstra2012Random proposes to randomly sample points from the hyperparameter configuration space. Although this approach looks simple, but it could find comparable hyperparameter configuration to grid search with less computation time. Hyperparameter optimization in deep neural networks is a computational expensive problem where evaluating a hyperparameter choice may cost several hours or even days. This property also makes it unrealistic to sample many enough points to be evaluated in Grid and Random search. One popular approach is using efficient surrogates to approximate the computationally expensive fitness functions to guide the optimization process. Bayesian optimization Snoek2012Practical built a probabilistic Gaussian model surrogate to estimate the distribution of computationally expensive validation errors. Hyperparameter configuration space is usually modeled smoothly, which means that knowing the quality of certain points might help infer the quality of their nearby points, and Bayesian optimization Bergstra2011Algorithms ; Shahriari2015Taking ; Bergstra2012Making

utilizes the above smoothness assumption to assist the search of hyperparameters. Gaussian Process is the most common method for modeling loss functions in Bayesian optimization for it is simple and flexible. There are several acquistion functions to determin the next promising points in Gaussian process, including Probability of Improvement (PI), Expected Improvement (EI), Upper Confidence Bound (UCB) and the Predictive Entropy Search (PES)

Snoek2012Practical ; Hoffman2014Predictive .

2.3 Surrogate-assisted evolutionary algorithm

Surrogate-assisted evolutionary algorithm was designed to solve expensive optimization problems whose fitness function is highly computationally expensive Jin2011Surrogate ; Jin2009A ; Douguet2010e . It usually utilizes computationally efficient models, also called as surrogates, to approximate the fitness function. The surrogate model is built as:

(2)

where is the true fitness value, is the approximated fitness value, and is the error function that is to minimized by the selected surrogate. Surrogate-assisted evolutionary algorithm uses one or several surrogate models to approximate true fitness value and uses the computationally cheap surrogate to guide the search process Zhang2010Expensive . The iteration of the surrogate-assisted evolutionary algorithm is described as: 1) Learn surrogate model based on previously truly evaluated points ; 2) Utilize to evaluate new mutation-generated points and find the most promising individual . 3) evaluate the true fitness value of additional points . 4) Update training set.

Gaussian process, polynomials, Radial Basis Functions (RBFs), neural networks, and Support Vector Machines are major techniques to approximate true objective function for surrogate model learning. A non-stationary covariance function based Gaussian process is adopted as the surrogate model in this paper, which allows the model adapt spatial dependence structure to vary with locations and satisfies our assumption that the hyperparameter configuratio performs well near the optimal points while poorly away from the optimal point. Then the evolutionary strategy is used to search the near-optimal hyperparameter configuration. The next section will present the details of our Hyperparameter Optimization with sUrrogate-aSsisted Evolutionary Strategy (HOUSES) for ML-CNN.

3 Hyperparameter Optimization with sUrrogate-aSsisted Evolutionary Strategy

In our previous work lyvjuan , a multi-level convolutional neural network is proposed for lung nodules classification, which applies different kernel sizes in three parallel levels of CNNs to effectively extract different features of each lung nodule with different sizes and various morphologies. Fig. 1 presents the structure of ML-CNN, which contains three level of CNNs and each of them has same structure and different kernel size. As suggested in our previous work, feature maps number in each convolutional layer has significant impact on the performance of ML-CNN, so as the dropout rates. The hyperparameter configuration of ML-CNN in lyvjuan is based on trial and error manual search approach, which is a time-consuming work for researcher and has no guarantee to get an optimal configuration. In this section, we introduce Hyperparameter Optimization with sUrrogate-aSsisted Evolutionary Strategy(HOUSES) to our ML-CNN for lung nodule classification, which could automatically find a competitive or even better hyperparameter configuration than manual search method without too much computational cost. The framework of the proposed HOUSES for ML-CNN is presented in Algorithm 1. In our hyperparameter optimization method, a non-stationary kernel is proposed as covariance function to define the relationship between different hyperparameter configurations, which allows the model adapt spatial dependence structure to vary with a function of location, and the algorithm searches for the most promising hyperparameter values based on surrogate model through evolutionary strategy. In our HOUSES, several initial hyperparameter configuration points are randomly generated through Latin Hypercube Sampling (LHS) Iman2008Latin methods to keep diversity of the initial population. These initial points are truly evaluated and used as the training set to build the initial surrogate model. Then the evolutionary strategy generates a group new points which are evaluated according to the acquisition function of the surrogate model. Several most promising individuals are found from those new generated points based on acquisition function and then truly evaluated. The most promising points with true fitness value are added to training set to update surrogate model. We describe our HOUSES in the following paragraphs.

Figure 1: The structure of proposed ML-CNN for lung nodule malignancy classification lyvjuan .
Input: Initial population size , Maximum generation , Mutation rate , number of new generated points every generation , ,
Output: best hyperparameter configuration for DNN model
Divide dataset into Training, Validation and Testing sets Initialization A hyperparameter configuration population is randomly generated through Latin Hypercube Sampling. These hyperparameter points are used to train DNN model in Training set, and truly evaluated in the Validation set to get true fitness values . while Maximum generation is not reached do
       1. Use to fit or update the Gaussian surrogate model according Eq.(3); 2. = select ()// select individuals with good performance and diversity for mutation; 3. = mutation ()// apply mutation operation to selected points to generate new points; 4. Calculate for new generated points based on Gaussian surrogate model and acquisition functions Eq.(11)(12)(13); 5. Set ; 6. Truly evaluate in Training set and Validation set to get true fitness values; 7. Update ;
end while
Return the hyperparameter configuration .
Algorithm 1 General Framework of HOUSES

3.1 surrogate model building

Gaussian process (also known as Kriging) is choosed as the surrogate model in HOUSES searching for the most promising hyperparameters, which uses a generalization of the Gaussian disribution to describe a function, defined by a mean , and covariance function :

(3)

Given training data that consists -dimensional inputs and outputs, , where and . The predictive distribution based on Gaussian process at an unknown input, , is calculated by the following:

(4)
(5)

where and , is a noise parameter, is the associated covariance matrix which is built as:

(6)

is a covariance function that defines the relationship between points in the forms of a kernel. A used kernel is automatic relevance determination (ARD) squared exponential covariance function:

(7)

3.2 Non-stationary covariance function for hyperparameter optimization in DNNs

3.2.1 Spatial location transformation

In the hyperparameter optimization of DNNs, two far away hyperparameter points usually perform both poorly when they are away optimal point. This property means that the correlation of two hyperparameter configuration depends not only on the distance between them, but also the points’ spatial locations. Those stationary kernel, such as Gaussian kernel, clearly could not satisfy this property of hyperparameter optimization in DNNs. To account for this non-stationarity, we proposed a non-stationary covariance function, where we use the relative distance to optimal point to measure the spatial location difference of two hyperparameter points. The relative distance based kernel is defined as:

(8)

where is the assumed optimal point. It is also easy to prove this relative distance based covariance function is a kernel based on Theorem 1 1. Eq.(8) could be obtained by set and as Gaussian kernel. This relative distance based kernel is no longer a function of distance between two points, but depends on their own spatial locations to the optimal point.

Theorem 1

if is an -valued function on and is a kernel on , then

(9)

is also a kernel.

Proof: , , is a valid kernel, then we have

so that

is a kernel.

3.2.2 Input Warping

In the hyperparameter optimization of machine learning models, objective functions are usually more sensitive near the optimal hypeparameter setting while much less sensitive far away from the optimum. For example, if the optimal learning rate is 0.05, it is supposed to obtain 50% performance increase when the learning rate changing from 0.04 to 0.05, while may just 5% increase from 0.25 to 0.24. Traditionally, most researchers often use logarithm function to transform the input space and then search in the transformed space, which is effective only when the non-stationary property of the input space is known in advance. Recently, a beta cumulative distribution function is proposed as the input warping transformation function

snoek2014input ; swersky2017improving ,

(10)

where is the beta function, which is to adjust the shape of input warping function to the original data based on parameters , and .

Figure 2: Example of how Kumaraswamy cumulative distribution function transforming a concave function into a convex function, which makes the kernel function is much more sensitive to small inputs.

Different from snoek2014input ; swersky2017improving , we just take the relative distance to local optimum as inputs to be warped that make kernel function is more sensitive to small inputs and less sensitive for large ones. We take the Kumaraswamy cumulative distribution function as the substitute, which is not only because of computational reasons, but also it is easier to fulfill the non-stationary property of our kernel function after spatial location transformation,

(11)

Similar to Eq.(9), it is easy to be proven that is a kernel. Fig.2 illustrates input warping example with different shape parameters , and input warping functions. And the final kernel for our HOUSE is defined as:

(12)

Eq.(12) is also a kernel proved based on Theorem 2 2. This non-stationary kernel Eq.(12) satisfies the assumption that the correlation function of two hyperparameter configuration is not only dependent on their distances, but their relative locations to optimal point. However it is impossible to get the optimal point in advance, and we instead use the hyperparameter configuration with best performance in the train set, and updates it in every iteration in our proposed HOUSES.

Theorem 2

If is a kernel on and is also a kernel on , then

(13)

is also a kernel.

Proof: This is because if and are valid kernels on , then we have and , we may define

so that

is a kernel.

3.3 Acquisition function

After building a surrogate model, an acquisition function is required to choose the most promising point for truly evaluation. Different with surrogate model that approximates the optimizing problem, the acquisition function is to be utilized to find the most possible optimal solution.

We applied three different acquisition functions for Gaussian process (GP) based hyperparemeter optimization:

  • Probability of Improvement

    (14)

    where .

  • Expected Improvement

    (15)

    where is the variable

    has a Gaussian distribution with

    .

  • and Upper Confidence Bound

    (16)

    with a tunable w to balance exploitation against exploration Ilija .

3.4 Hyperparameter Importance based Mutation for candidate hyperparameter points generation

The mutation aims to generate better individuals through mutating selected excellent individuals, which is a key step of optimization in evolutionary strategy. To maintain the diversity of the population, a uniformed selection strategy is adopted in mutation. It first divides every dimension into uniformed grids Zhang2018A , and the point with the highest fitness in every dimensional grid is selected for mutation. In this way, individules are selected and the polynomial mutation is applied to every selected individual to generate candidate hyperparameter points, respectively. These points are evaluated based on acquisition fuction, and the most promising point is selected for truly evaluation and added into the training set to update surrogate model.

Figure 3: Functional ANOVA based marginal response performance of the number of feature maps of all convolutional layers in three different levels of ML-CNN. The first two parameters are for the two convolutional layers in the first level, the middle two are for the second level, and the last two are for the third level. Results show that the latter ones in the three level brings more effects to the performance, while there is no significant difference among all possible configuration for the previous feature maps number in each level of ML-CNN.

However, as suggested by several recent works on Bayesian based hyperparameter optimization hoos2014efficient ; Bergstra2012Random , most hyperparameters are truly unimportant while some hyperparameters are much more important than others. Fig. 3 demonstrates ML-CNN marginal performance variation with the number of feature maps, which clearly shows that the number of feature maps in the last convolutioanry layer in every layer is much more crucial to ML-CNN than previous ones. Paper hoos2014efficient

proposed to use Functional analysis of variance (functional ANOVA) to measure the importance of hyperparameters in machine learning problems. Functional ANOVA is a statistical method for prominent data analysis, which partitions the observed variation of a response value (CNN performance) into components due to each of its inputs (hyperparameter setting). Function ANOVA is able to illustrate how the response performance changes with input hyperparameters. It first accumulates the response function values of all subsets of it inputs

:

(17)

where the component is defined as:

(18)

where the constant is the mean value of the function over its domain, is the marginal predicted performance defined as . The subset captures the interaction between all hyperparameters in subset , while we only consider the separate hyperparameter importance in this paper and set . The component function is then calculated as:

(19)

where is the single hyperparameter, , , . The variance of response performance of across its domain is

(20)

The importance of each hyperparameter could thus be quantified as:

(21)

When the polynomial mutation operator is applied to individuals, genes corresponding to different hypeparameters have different mutation probabilities in terms of hyperparameter importances, where genes with lager importances are supposed to have higher mutation probabilities to generate more offsprings. In this way, our evolutionary strategy is supposed to put more emphases in those subspaces of important hyperparameters and find better hyperparameter settings.

4 Experimental Design

To examine the optimization performance of our proposed HOUSES for hyperparameter optimization, two sets of experiments have been conducted. We test HOUSES on a Multi-layer Perceptron (MLP) network and LeNet applied to the popular MNIST dataset, and AlexNet applied to CIFAR-10 dataset in the first one. For second set, there is only one experiment whose target is to find an optimal hyperparameter configuration of ML-CNN applied to lung nodule classificayion. All the experiments were performed with Nvidia Quadro P5000 GPU (16.0 GB Memory, 8873 GFLOPS). Our experiments are implemented in Python 3.6 environment, and Tensorflow

111https://github.com/tensorflow/tensorflow and Tensorlayer 222https://github.com/tensorlayer/tensorlayer are used for building deep neural networks.

The following subsections present a brief introduction of experimental problems, peer algorithms, and evaluation budget and experimental setting.

4.1 DNN problems

The first DNN problem in the first experimental set is MLP network applied to MNIST, which consists of three dense layers with ReLU activation and dropout layer between them and SoftMax at the end. The hyperparameters we optimize with HOUSES and other peer algorithms include dropout rate in each dropout layer and number of units in dense layers. This problem has 5 parameters to be optimized, which is described as 5-MLP in this paper. The second DNN problem is LeNet5 applied to MNIST that has 7 hyperparameters to be optimized, which is described as 7-CNN. The 7-CNN contains two convolutional blocks, each containing one convolutional layer with batch normalization, followed by

activation and max-pooling, and three fully-connected layers with two dropout layers among them are followed at the end. The optimizing parameters in 7-CNN contain the feature maps number in every convolutional layer, the units number in the first two fully-connected layers and also thedropout rates in all dropout layers. The third DNN problem in the first set is to optimize the hyperparameters of AlexNet applied to CIFAR-10 dataset. There are 9 parameters: feature numbers in 5 converlutional layers, numbers of units in two fully-connected layers and the dropout rate of the dropout layer after them. This is described as 9-CNN problem in this paper.

In the second experimental set, we evaluate HOUSES on ML-CNN applied to lung nodule classificayion. There are also 9 hyperparameters to be optimized, which consists of number of feature maps in every convolutioanl layer, number of unites in full-connected layer, and dropout rate of every dropout layer. This hyperparameter optimization problem is denoted as 9-ML-CNN in this paper. The lung nodule images in this experiment are from the Lung Image Database Consortium (LIDC) and Image Database Resource Initiative (IDRI) database 3Rd2012The ; Reeves2007The , containing 1,018 cases from 1,010 patients and are annotated by 4 radiologists. The malignancy suspiciousness of each nodule in the database is rated from 1 to 5 annotated by four radiologists, where level 1 and 2 are benign nodules, level 3 is indeterminate nodule and level 4 and 5 are malignant nodule. The diagnosis of nodule is labeled to the class with the highest frequency, or indeterminate when more than one class have the highest frequency. The nodules in the images are cropped according to the contour annotations of 4 radiologists and resized by as the input of our multi-level convolutional neural networks.

4.2 Peer algorithm

We compare HOUSES against Random search, Gaussian processes (GP) with Gaussian kernel, and Hyperparameter Optimization via RBF and Dynamic coordinate search (HORD). We also compared three different acquisition functions for Gaussian processes (GP) based hyperparemeter optimization: Gaussian processes with Expected Improvement (GP-EI), Gaussian processes with Probability of Improvement (GP-PI), and Gaussian processes with Upper Confidence Bound (GP-UCB).

4.3 Evaluation budget and experimental setting

Hyperparameter configuration evaluation is typically computationally expensive which consists of the most computation cost in DNN hyperparameter optimization problem. For fair comparison, we set the number of function evaluations as 200 for all comparing algorithms. The number of training iterations for MNIST dataset is set as 100, and CIFAR-10 and LIDC-IDRI are set as 200 and 500, respectively.

We implement the Random search with the open-source HyperOpt library 333http://hyperopt.github.io/hyperopt/. We use the public sklearn library 444https://scikit-learn.org/stable/modules/gaussian_process.html to build Gaussian Processes based surrogate model. The implementation for HORD is at 555bit.ly/hord-aaai. The code for hyperparameter importance assessing based on functional ANOVA is available at 666https://github.com/automl/fanova.

5 Experimental results and discussion

DNN Problems 5-MLP 7-CNN 9-CNN 9-ML-CNN
Random Search 0.9731 0.9947 0.7429 0.8401
HORD 0.9684 0.9929 0.7471 0.8421
GP-EI 0.9647 0.9934 0.7546 0.8517
GP-PI 0.9645 0.9937 0.7650 0.8473
GP-UCB 0.9637 0.9942 0.7318 0.8457
HOUSES-EI 0.9698 0.9931 0.7642 0.8511
HOUSES-PI 0.9690 0.9949 0.7683 0.8541
HOUSES-UCB 0.9738 0.9937 0.7493 0.8576
Manual Tuning - - - 0.8481
Table 1: Experimental mean accuracy of comparing algorithms on 4 DNN problems
Algorithm 5-MLP 7-CNN 9-CNN
Sensitivity Specificity AUC Sensitivity Specificity AUC Sensitivity Specificity AUC
Random Search 0.95054 0.99521 0.97588 0.98809 0.99569 0.99339 0.7590 0.97130 0.8617
HORD 0.95085 0.99458 0.97272 0.98398 0.99832 0.99111 0.76690 0.97410 0.87050
GP-EI 0.95474 0.99502 0.97488 0.95576 0.99840 0.99182 0.76800 0.97420 0.87110
GP-PI 0.93838 0.99324 0.96581 0.98706 0.99857 0.99281 0.7571 0.9730 0.86505
GP-UCB 0.93604 0.99298 0.96451 0.98511 0.99842 0.99207 0.7609 0.97343 0.86717
HOUSES-EI 0.93642 0.99300 0.96472 0.98414 0.99855 0.99519 0.76940 0.97438 0.87200
HOUSES-PI 0.94486 0.99395 0.96940 0.98517 0.99857 0.99377 0.7798 0.97553 0.87767
HOUSES-UCB 0.96161 0.99578 0.97870 0.98578 0.99852 0.99355 0.7609 0.9343 0.86717
Table 2: Comparison results in three DNN problems

5.1 Experiments on MNIST and CIFAR-10

In this section, we evaluate these peer hyperparameter optimization algorithms on 3 DNN problems, including MLP applied to MNIST (5-MLP),and LeNet network to MNIST (7-CNN), and AlexNet applied to CIFAR10 (9-CNN).

For 5-MLP problem, Table 1 (Column 2) shows the obtained test results of different comparing methods, and Figure 2 (a) also plots the average accuracy over epochs of the obtained hyperparameter configurations from different hyperparameter optimization methods. One surprised observation from the above table and figure is that the simplest Random Search method could get satisfied results, which sometimes even outperforms some Bayesian optimization based methods (GPs and HORD). This phenomenon suggests that, for low-dimensional hyperparameter optimization, the simple Random Search could perform very well, which is also in line with

Bergstra2012Random . Furthermore, we can also find from Table 1 (Column 2) and Figure 2 (a) that, with the same experimental settings, our proposed non-stationary kernel clearly perform better comparing with stationary Gaussian kernel with all three acquisition functions in 5-MLP problem. It is also demonstrates that incorporating priors based on expert intuition into Bayesian optimization and designing a non-stationary kernel is necessary for Gaussian processes based hyperparameter optimization.

In the 7-CNN problem, we found that most hyperparameter optimization algorithms could obtain satisfied result, and whose test errors are less than the best result in 5-MLP problem (see Column 3 of Table 1). These results demonstrate that a better neural network structure could significantly improve the performance, and is more robust to hyperparameter configuration, where there is not much significant difference for those hyperparameter optimization methods in 7-CNN problem. So designing an appropriate neural network structure is the first importance, and this is also the reason why we design a Multi-Level Convolutionary Neural Network for lung nodule classification.

As to the more complicated and harder DNN problem 9-CNN, GPs could found significantly better hyperparameters than Random Search algorithm, except GP-UCB, which may due to the improper weighting setting in UCB acquisition function (see Column 4 of Table 1, Figure 2 (c)). These results also shows that Random Search algorithm performs extremely poorer than other hyperparameter optimization algorithms in 9-CNN problem, and suggests that a hyperparameter optimization is required in complicated DNN problems which helps the deep neural network to reach the full potential. Similar to those results in 5-MLP problem, the results in 9-CNN again show the superiority of our proposed non-stationary kernel for Hyperparameter optimization in CNN, where non-stationary kernel always outperforms than standard Gaussian kernel. Figure 2 (c) shows that the performances of HOUSES and GP with different acquisition functions are distinguishable, where HOUSE-PI and GP-PI clearly get better results than other two acquisition functions. In addition, the results also show the importance of a suitable acquisition function, where GPs with UCB acquisition function even get worse results than Random Search in 9-CNN while get the best results in 5-MLP problem among all methods.

Table 2 summarizes the sensitivity and specificity (accuracy has been present in Table 1) of hyperparameter configuration obtained by all comparing algorithm for 5-MLP, 7-CNN, and 9-CNN, and the three indicators are defined as:

(22)

We also calculated Area Under Curve (AUC) Chien2003Pattern as the assessment criteria for Receiver Operating Characteristic (ROC) curve in Table 2. As demonstrated in Table 2, our HOUSES appraoch outperforms Random Search, HORD and normal kernel based Gaussian processes in accuracy, sensitivity and specificity in 5-CNN and 9-CNN problems. In 5-MLP problem, Random Search also gets incredible results, which also suggests that the simple Random Search could perform very well in low-dimensional hyperparameter optimization. Although there are just 1-2 percentage increase on the classification rate, it is a significant improvement for hyperparameter optimization. There is no much statistic differences between these comparing algorithms in the results of 7-CNN, which again demonstrates that a better neural network structure could significant improve the performance and relieve the work of hyperparameter optimization works.

Figure 4: *

(a)Testing accuracy of all hyperparameter optimization algorithms on 5-MLP problem.

Figure 5: *

(b) Testing accuracy of all hyperparameter optimization algorithms on 7-CNN problem.

Figure 6: *

(c)Testing accuracy of all hyperparameter optimization algorithms on 9-CNN problem.

Figure 7: *

(d) Testing accuracy of all hyperparameter optimization algorithms on 9-ML-CNN problem.

Figure 8: Testing accuracy of all hyperparameter optimization algorithms on four DNN problems.

5.2 Experiments on LIDC-IDRI (Lung noddule classification problem)

In this section, we evaluate HOUSES and all comparing algorithms applied to 9-ML-CNN (Multi-Level Convolutional Neural Network applied to lung nodule classification with 9 hyperparameters to be optimized), and the results are demomstrated in Table 1 Column 4, and Fig.8. As expected, the performance of conventional hyperparameter optimization methods degrades significantly in complicated and high dimensional search space, while HOUSES continues to get satisfied results and outperforms the Gaussian process with stationary kernels. Similar to those result in previous subsection, we found that UCB gets the best result among three acquisition functions, which also suggests that UCB may be the most appropriate acquisition function in 9-ML-CNN hyperparameter optimization.

We presents the test accuracy over iterations of the obtained hyperparameter configurations for 9-ML-CNN problem from different hyperparameter optimization methods in Figure 2 (d). It is obvious that our spatial location based non-stationary kernel outperform stationary Gaussian kernel with three different acquisition functions, which also indicates a non-stationary kernel is especially necessary for complicated CNN hyperparameter optimization. We observe that HOUSES-UCB reaches better validation accuracy in only 250 epochs than manual tuning method lyvjuan

. Moreover, Table 3 present the ability of ML-CNN with hyperparameter configurations obtained by different hyperparameter optimization methods to classify different type of malignant nodules, and shows the sensitivity, specificity and AUC on three types of malignant nodules.

Results in 9-ML-CNN problem again shows that using a non-stationary kernel that takes the spatial location in consideration significantly improves the convergence of the hyperparameter optimization, especially for high-dimensional and complicated deep neural networks, and our hyperparameter optimization method HOUSES could not only relieve the trivial work to tune hyperparameters, but also get better results in terms of accuracy compared with manual tuning lyvjuan . Experimental results from above show that the non-stationary assumption is non-trivial for hyperparameter optimization in DNN with Bayesian methods, and incorporating priors based on expert intuition into Bayesian optimization framework is supposed to improve the optimization effectiveness.

Algorithm Benign Indeterminate Malignant
Sensitivity Specificity AUC Sensitivity Specificity AUC Sensitivity Specificity AUC
Random Search 0.79260 0.92026 0.86643 0.83593 0.85256 0.85925 0.81320 0.92327 0.87814
HORD 0.83457 0.92735 0.88096 0.83447 0.90906 0.87226 0.86095 0.92947 0.89521
GP-EI 0.83395 0.92615 0.88005 0.83790 0.91208 0.87499 0.85685 0.92640 0.89160
GP-PI 0.81913 0.93726 0.89239 0.85314 0.91601 0.87604 0.85385 0.91874 0.88955
GP-UCB 0.82407 0.93155 0.88707 0.85314 0.91994 0.87740 0.85385 0.92364 0.89792
HOUSES-EI 0.84259 0.94116 0.88262 0.83406 0.89758 0.87536 0.87219 0.92702 0.89043
HOUSES-PI 0.84753 0.93966 0.87940 0.83608 0.89819 0.87109 0.86035 0.92180 0.88871
HOUSES-UCB 0.85000 0.93546 0.89273 0.82328 0.91516 0.86919 0.87745 0.92000 0.89371
Manual Tuning 0.80617 0.93455 0.87036 0.87751 0.85468 0.86610 0.79882 0.95216 0.87549
Table 3: Comparison results of 9-ML-CNN problem for each class

6 Conclusion

In this paper, a Hyperparameter Optimization with sUrrogate-aSsisted Evolutionary Strategy, named HOUSES is proposed for CNN hyperparameter optimization. A non-stationary kernel is devised and adopted as covariance function to define the relationship between different hyperparameter configurations to build Gaussian processes model, which allows the model adapts spatial dependence structure to vary with a function of location. Our previous proposed multi-level convolutional neural network (ML-CNN) is developed for lung nodule malignancy classification, whose hyperparameter configuration is optimized by our HOUSES. Experimental results on several deep neural networks and datasers validated that our non-stationary kernel based approach could find better hyperparameter configuration than other approaches, such as Random search, Tree-structured Parzen Estimator (TPE), Hyperparameter Optimization via RBF and Dynamic coordinate search (HORD), and stationary kernel based Gaussian kernel Bayesian optimization. Experimental results suggest that, even though Random Search is a simple and effective way for CNN hyperparameter optimization, it is hard to find satisfactory configuration for high-dimensional and complex deep neural networks, and incorporating priors based on expert intuition into conventional Bayesian optimization framework is supposed to improve the optimization effectiveness. Furthermore, the results also demonstrate devising a suitable network structure is a more robust way to improve performance, while hyperparameter optimization could help achieve the potential of the network.

In light of the promising initial research results, our future research will focus on extending HOUSES to deep neural networks architecture search. Several works have been proposed to automatically search for well-performing CNN architectures via hill climbing procedure Elsken2017Simple , Q-Learning Zhong2018Practical , sequential model-based optimization (SMBO), and so on Negrinho2018DeepArchitect

, and genetic programming approach

Suganuma2017A . However, there are few works that utilize surrogate model to reduce the expensive complexity required by CNN searching. Moreover, a simple evolutionary strategy is not a appropriate method to search the surrogate for optimal architecture design, which is a variable-length optimization problem Lehman2011Evolving , and quality-diversity based evolutionary algorithm may provide a solution to it.

References

  • (1) Armato Sg 3Rd, G Mclennan, L Bidaut, M. F. Mcnitt-Gray, C. R. Meyer, A. P. Reeves, B. Zhao, D. R. Aberle, C. I. Henschke, and E. A. Hoffman. The lung image database consortium (lidc) and image database resource initiative (idri): a completed reference database of lung nodules on ct scans. Medical Physics, 38(2):915, 2012.
  • (2) M Anthimopoulos, S Christodoulidis, L Ebner, A Christe, and S Mougiakakou. Lung pattern classification for interstitial lung diseases using a deep convolutional neural network. IEEE Transactions on Medical Imaging, 35(5):1207–1216, 2016.
  • (3) J Bergstra, D Yamins, and D. D Cox. Making a science of model search. 2012.
  • (4) James Bergstra and Yoshua Bengio. Algorithms for hyper-parameter optimization. In International Conference on Neural Information Processing Systems, pages 2546–2554, 2011.
  • (5) James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(1):281–305, 2012.
  • (6) Konstantinos Chatzilygeroudis, Roberto Rama, Rituraj Kaushik, Dorian Goepp, Vassilis Vassiliades, and Jean Baptiste Mouret. Black-box data-efficient policy search for robotics. In Ieee/rsj International Conference on Intelligent Robots and Systems, 2017.
  • (7) Y. Chien. Pattern classification and scene analysis. IEEE Transactions on Automatic Control, 19(4):462–463, 2003.
  • (8) Kenneth Clark, Bruce Vendt, Kirk Smith, John Freymann, Justin Kirby, Paul Koppel, Stephen Moore, Stanley Phillips, David Maffitt, and Michael Pringle. The cancer imaging archive (tcia): Maintaining and operating a public information repository. Journal of Digital Imaging, 26(6):1045–1057, 2013.
  • (9) Hongbin Dong, Tao Li, Rui Ding, and Jing Sun.

    A novel hybrid genetic algorithm with granular information for feature selection and optimization.

    Applied Soft Computing, 65, 2018.
  • (10) Dominique Douguet. e-lea3d: a computational-aided drug design web server. Nucleic Acids Research, 38(Web Server issue):615–21, 2010.
  • (11) Thomas Elsken, Jan-Hendrik Metzen, and Frank Hutter. Simple and efficient architecture search for convolutional neural networks. arXiv preprint arXiv:1711.04528, 2017.
  • (12) Marc G. Genton. Classes of kernels for machine learning: A statistics perspective. Journal of Machine Learning Research, 2(2):299–312, 2002.
  • (13) R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In

    2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , volume 00, pages 580–587, June 2014.
  • (14) Rotem Golan, Christian Jacob, and Jörg Denzinger. Lung nodule detection in ct images using deep convolutional neural networks. In International Joint Conference on Neural Networks, pages 243–250, 2016.
  • (15) Matthew W. Hoffman and Zoubin Ghahramani. Predictive entropy search for efficient global optimization of black-box functions. In International Conference on Neural Information Processing Systems, pages 918–926, 2014.
  • (16) Holger Hoos and Kevin Leyton-Brown. An efficient approach for assessing hyperparameter importance. In International Conference on Machine Learning, pages 754–762, 2014.
  • (17) Ilija Ilievski, Taimoor Akhtar, Jiashi Feng, and Christine Shoemaker. Efficient hyperparameter optimization for deep learning algorithms using deterministic rbf surrogates, 2017.
  • (18) Ronald L. Iman. Latin Hypercube Sampling. John Wiley & Sons, Ltd, 2008.
  • (19) Yaochu Jin.

    Surrogate-assisted evolutionary computation: Recent advances and future challenges.

    Swarm and Evolutionary Computation, 1(2):61–70, 2011.
  • (20) Yaochu Jin and B Sendhoff. A systems approach to evolutionary multiobjective structural optimization and beyond. Computational Intelligence Magazine IEEE, 4(3):62–76, 2009.
  • (21) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In International Conference on Neural Information Processing Systems, pages 1097–1105, 2012.
  • (22) Yann Lecun, Leon Bottou, Genevieve B. Orr, and Klaus Robert Müller. Efficient backprop. Neural Networks Tricks of the Trade, 1524(1):9–50, 1998.
  • (23) Joel Lehman and Kenneth O Stanley. Evolving a diversity of virtual creatures through novelty search and local competition. In Proceedings of the 13th annual conference on Genetic and evolutionary computation, pages 211–218. ACM, 2011.
  • (24) Kui Liu and Guixia Kang. Multiview convolutional neural networks for lung nodule classification. Plos One, 12(11):12–22, 2017.
  • (25) Ilya Loshchilov and Frank Hutter. CMA-ES for hyperparameter optimization of deep neural networks. CoRR, abs/1604.07269, 2016.
  • (26) Juan Lyu and Sai Ho Ling. Using multi-level convolutional neural network for classification of lung nodules on ct images. In 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pages 686–689. IEEE, 2018.
  • (27) Alan Tan Wei Min, Yew Soon Ong, Abhishek Gupta, and Chi Keong Goh. Multi-problem surrogates: Transfer evolutionary multiobjective optimization of computationally expensive problems. IEEE Transactions on Evolutionary Computation, PP(99):1–1, 2017.
  • (28) Renato Negrinho and Geoff Gordon. Deeparchitect: Automatically designing and training deep architectures. 2017.
  • (29) Renato Negrinho and Geoff Gordon. Deeparchitect: Automatically designing and training deep architectures. arXiv preprint arXiv:1704.08792, 2017.
  • (30) Anthony P. Reeves and Alberto M. Biancardi. The lung image database consortium (lidc) nodule size report. http://www.via.cornell.edu/lidc/, 20011.
  • (31) Anthony P. Reeves, Alberto M. Biancardi, Tatiyana V. Apanasovich, Charles R. Meyer, Heber Macmahon, Edwin J. R. Van Beek, Ella A. Kazerooni, David Yankelevitz, Michael F. Mcnittgray, and Geoffrey Mclennan. The lung image database consortium (lidc): pulmonary nodule measurements, the variation, and the difference between different size metrics. In Medical Imaging 2007: Computer-Aided Diagnosis, pages 1475–1485, 2007.
  • (32) Rommel G. Regis and Christine A. Shoemaker. Combining radial basis function surrogates and dynamic coordinate search in high-dimensional expensive black-box optimization. Engineering Optimization, 45(5):529–555, 2013.
  • (33) Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P. Adams, and Nando De Freitas. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2015.
  • (34) W. Shen, M. Zhou, F. Yang, C. Yang, and J. Tian. Multi-scale convolutional neural networks for lung nodule classification. Inf Process Med Imaging, 24:588–599, 2015.
  • (35) Wei Shen, Mu Zhou, Feng Yang, Dongdong Yu, Di Dong, Caiyun Yang, Yali Zang, and Jie Tian. Multi-crop convolutional neural networks for lung nodule malignancy suspiciousness classification. Pattern Recognition, 61(61):663–673, 2017.
  • (36) R. L. Siegel, K. D. Miller, S. A. Fedewa, D. J. Ahnen, R. G. Meester, A Barzi, and A Jemal. Colorectal cancer statistics, 2017. Ca Cancer J Clin, 67(3):104–17, 2017.
  • (37) Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. Practical bayesian optimization of machine learning algorithms. In International Conference on Neural Information Processing Systems, pages 2951–2959, 2012.
  • (38) Jasper Snoek, Kevin Swersky, Rich Zemel, and Ryan Adams. Input warping for bayesian optimization of non-stationary functions. In International Conference on Machine Learning, pages 1674–1682, 2014.
  • (39) Q. Song, L. Zhao, X. Luo, and X. Dou. Using deep learning for classification of lung nodules on computed tomography images. J Healthc Eng., 2017(1):1–7, 2017.
  • (40) Masanori Suganuma, Shinichi Shirakawa, and Tomoharu Nagao. A genetic programming approach to designing convolutional neural network architectures. pages 497–504, 2017.
  • (41) Wenqing Sun, Bin Zheng, and Wei Qian. Computer aided lung cancer diagnosis with deep learning algorithms. In Medical Imaging 2016: Computer-Aided Diagnosis, 2016.
  • (42) Kevin Jordan Swersky. Improving Bayesian Optimization for Machine Learning using Expert Priors. PhD thesis, 2017.
  • (43) Naiyan Wang, Siyi Li, Abhinav Gupta, and DitYan Yeung. Transferring rich feature hierarchies for robust visual tracking. Computer Science, 2015.
  • (44) Miao Zhang and Huiqi Li. A reference direction and entropy based evolutionary algorithm for many-objective optimization. Applied Soft Computing, 70:108–130, 2018.
  • (45) Qingfu Zhang, Wudong Liu, Edward Tsang, and Botond Virginas. Expensive multiobjective optimization by moea/d with gaussian process model. IEEE Transactions on Evolutionary Computation, 14(3):456–474, 2010.
  • (46) Zhao Zhong, Junjie Yan, Wei Wu, Jing Shao, and Cheng-Lin Liu. Practical block-wise neural network architecture generation. pages 2423–2432, 2018.