HYPPO: A Surrogate-Based Multi-Level Parallelism Tool for Hyperparameter Optimization

10/04/2021 ∙ by Vincent Dumont, et al. ∙ 65

We present a new software, HYPPO, that enables the automatic tuning of hyperparameters of various deep learning (DL) models. Unlike other hyperparameter optimization (HPO) methods, HYPPO uses adaptive surrogate models and directly accounts for uncertainty in model predictions to find accurate and reliable models that make robust predictions. Using asynchronous nested parallelism, we are able to significantly alleviate the computational burden of training complex architectures and quantifying the uncertainty. HYPPO is implemented in Python and can be used with both TensorFlow and PyTorch libraries. We demonstrate various software features on time-series prediction and image classification problems as well as a scientific application in computed tomography image reconstruction. Finally, we show that (1) we can reduce by an order of magnitude the number of evaluations necessary to find the most optimal region in the hyperparameter space and (2) we can reduce by two orders of magnitude the throughput for such HPO process to complete.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

Nijesh Upreti

Please check this out!

Authors

page 4

page 6

page 8

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep learning models are increasingly used throughout the sciences to achieve various goals, including the acceleration of time-consuming computer simulations, for making predictions in cases where the underlying physical behavior are poorly understood, for deriving new insights into physical relationships, and for decision support. E.g., in [pathak2020using], a low-fidelity simulation model is augmented with predictions made by a U-Net to obtain high-fidelity data at a fraction of the computational cost of running a high-fidelity simulation. In [Muller2020], DL models are used to make long-term predictions of groundwater levels in California in order to enable computationally fast predictions of future water availability. [10.1145/3437359.3465597]

used convolutional neural networks (CNNs) for anomaly detection in scientific workflows. DL models can also be used to co-optimize hardware acquisition parameters with image reconstruction in computational imaging, as in 

[cheng_illumination_2019, robey_optimal_2018]. [LAPOINTE2020133]

used neural networks to predict the CPU times of Ordinary Differential Equation (ODE) solvers, thus enabling the optimal selection of ODE solvers.

One obstacle to using DL models in new scientific applications is the decision about which hyperparameters (e.g., number of layers, nodes per layer, batch size) should be used. The hyperparameters impact the accuracy of the DL model’s predictions and should therefore be tuned. However, the predictive performance is subject to variability that arises due to the use of stochastic optimizers such as stochastic gradient descent (SGD) for training the models. This variability must be taken into account during hyperparameter optimization (HPO) in order to achieve models that produce accurate predictions reliably.

There are many libraries defined for hyperparameter tuning. Each of them has different strengths, compatibility, and weakness. Some of the frequently used libraries are Ray-Tune which supports state-of-the-art algorithms such as population-based training [jaderberg2017population], Bayes Optimization Search (BayesOptSearch) [bayesopt, 10.5555/2999325.2999464]; Optuna, which provides easy parallelization [akiba2019optuna]; Hyperopt, which works both serial and parallel ways and supports discrete, continuous, conditional dimensions [10.5555/3042817.3042832]; Polyaxon, which is for large scale applications [parameterlib].

However, little work discusses how the prediction uncertainty of deep learning models can be reduced with these approaches. This paper describes a new software tool, HYPPO, for automated HPO that is based on ideas from surrogate modeling, uncertainty quantification, and bilevel black-box optimization. In contrast to other HPO methods such as DeepHyper [8638041] and MENNDL [10.1145/2834892.2834896], our HPO implementation directly takes the prediction variability into account to deliver highly accurate and reliable DL models. We demonstrate the approach on the CIFAR10 image classification problem, time series prediction and a computed tomography image reconstruction application. We take full advantage of high-performance computing (HPC) environments by using asynchronous nested parallelization techniques to efficiently sample the high-dimensional hyperparameter space.

Relevance, impact, and contribution to the literature:

  1. Automated and adaptive HPO for DL models that directly accounts for uncertainty in model predictions;

  2. Significant decrease in the number of evaluations necessary to identify the most optimal region in the hyperameter space;

  3. Nested parallelism significantly accelerates time-to solution (up to two orders of magnitude);

  4. Compatible with both TensorFlow and PyTorch;

  5. HYPPO can run on large compute clusters as well as laptops and is able to exploit GPU and CPU;

  6. Provides reliable and robust models.

Ii Motivation

Much work on hyperparameter search extensively studies applications in CNNs for image recognition [xiao2020efficient] or general deep learning models [8638041]. Further development of various hyperparameter search libraries such as RLlib [pmlr-v80-liang18b], Hyperband [10.5555/3122009.3242042], DeepHyper [8638041], and even hand-tuning help iterate through multiple settings repeatedly until finding fairly well-performing hyperparameters. However, deep learning models can lead to multiple predictions with varying range of values, even after tuning the hyperparameters. Current hyperparameter tuning libraries often ignore this challenge of prediction variability. In this work we particularly target this problem of uncertainty quantification, by building in uncertainty when tuning hyperparameters. The ability to quantify the prediction uncertainty of DL models has the advantage that we can obtain confidence bounds for the DL model’s predictions, as illustrated in two examples in Figure 1. The additional information about prediction uncertainty can be very advantageous in various applications. For instance, instead of fully trusting a model architecture because of good predictive performance (which can be interpreted as a single realization of a stochastic process), UQ provides additional information on how reliably that architecture performs. If, for example, one was to predict future groundwater levels to enable sustainable groundwater management (see [Muller2020]

), having an idea about the prediction variability enables managers to develop strategies that are based on best, worst, and average case performance information and thus lead to robust management decisions. Similarly, in classification tasks, scientists may not only be interested in the class with highest expected probability, but also in information such as the second or third most probable class membership as well as uncertainties of those memberships which will allow further insights into the significance of differences between class memberships. In particular, in scientific applications where determining the correct class may have huge negative impacts if done incorrectly, this additional information is highly relevant. The main contributions of HYPPO include (1) an automated and adaptive search for hyperparameters using uncertainty quantification to evolve the best deep neural network solutions for both PyTorch and Tensorflow codes, and (2) demonstrate the scalability of our solution using high-performance computing (HPC) to leverage multiple threads and parallel processing to improve the overall search time for hyperparameters.

Our experiments cover time series prediction, image classification, and image reconstruction, and present how surrogate modeling can help to find optimal deep learning architectures that minimize the uncertainty in the solutions.

Iii Bi-Level Optimization Problem

Mathematically, we formulate the HPO problem as a bilevel bi-objective optimization problem:

(1)
s.t. (3)

Here, represents the set of hyperparameters to be tuned, describes the space over which the parameters are tuned (in our case an integer lattice), and and represent the loss and loss variability of the trained model, respectively. Ideally, we want to find optimal such that both and are minimized simultaneously. Note that most commonly only a single objective that represents a mean squared loss is minimized and the variability is not accounted for.

The values of , for each hyperparameter set for the validation dataset can be computed only after solving the lower level problem (Eq. 3), in which the weights and biases are obtained by training (e.g., with SGD). Solving this lower level problem is, depending on the size of the training dataset and the size of the architecture, computationally expensive. The loss variability arises when the same architecture is trained multiple times because different solutions are obtained (see Section IV, Feature 1, for details), and thus different values for the loss at the upper-level result. Our goal is to find optimal hyperparameters that yield high predictive accuracy (low values) and low prediction variability (low values).

In order to tackle this problem, we interpret the HPO problem as a computationally expensive black-box problem in which the black-box evaluation corresponds to a stochastic evaluation of with variability . Furthermore, to alleviate the complexities of solving a bi-objective optimization problem, we use a single-objective reformulation in which a weighted sum of and is minimized and the weights corresponding to each reflect their respective importance.

To solve our HPO problem, we use ideas from the surrogate model-based optimization algorithm presented in [Muller2020] and extend it to consider the model uncertainty. We make full use of HPC by exploiting asynchronous nested parallelism strategies which significantly reduces the time to finding optimal architectures.

Iii-a Surrogate Model-based HPO

Surrogate models such as radial basis functions (RBFs) 

[Powell1992] and Gaussian process models (GPs) [Matheron1963] have been widely used in the derivative-free optimization literature. These models are used to map the parameters to the model performance metrics, (here the dependence of on is implied by means of interpreting the lower level problem as black-box evaluation). The surrogate models are computationally cheap to build and evaluate and thus enable an efficient and effective exploration of the search space [Jones2001]. Adaptive optimization algorithms using global surrogate models usually consist of the following three steps:

  1. Create an initial experimental design and evaluate the costly objective function to obtain input-output data pairs.

  2. Build a surrogate model based on all data pairs.

  3. Solve a computationally cheap auxiliary optimization problem on the surrogate model to select new point(s) at which the expensive function is evaluated and go to step 2.

In order to apply these methods to HPO, special sampling strategies that respect the integer constraints of the hyperparameters are needed (see [Muller2020] for details). Moreover, the iterative sampling strategy must take into account the performance variability, which can be done by defining suitable model performance metrics.

Iv HYPPO Software: Design & Features

In this section, we provide details of specific features of our HYPPO software and implementation. We describe our approach to accounting for the prediction variability during HPO, the proposed surrogate modeling and sampling approaches, and the asynchronous nested parallelism that enables us to speed up the calculations significantly.

Feature 1: Uncertainty Quantification

HYPPO employs and applies uncertainty quantification (UQ) in conjunction with surrogate model-based HPO. The prediction variability present in Eq. (3) is epistemic where the variability observed manifests itself because Eq. (3) is a non-convex problem being solved in-exactly by stochastic algorithms. Non-convexity limits knowledge of global optimality, while in-exactly solving the lower level problem gives no guarantee that the used to evaluate the loss in Eq. (1) is a stationary point to Eq. (3).

The body of potential UQ approaches to bound the performance variability for a fixed hyperparameter set is thinned significantly by the nature of the problem. Since evaluating the loss is an expensive black-box function, any UQ method that depends on heavy sampling quickly becomes computationally intractable. Furthermore, distributional information for the performance value based on is unknowable beyond trivial cases because such knowledge is tied to the local optima of Eq. (3), which in general are neither computed during training nor enumerated. An additional blockage to distributional knowledge comes from the possibility that different global optima of Eq. (3

) likely yield different loss function values in Eq. (

1). Finally, flexible UQ approaches capable of handling various DL models are essential for the broad applicability of our software. Considering the inherent difficulties and desired properties delineated, the central framework we utilize for UQ is Monte Carlo (MC) dropout [Gal2016].

MC dropout was developed in the form presented by Gal and Ghahramani [Gal2016, Gal_PhD] as an alternative derivation of model averaging from [Srivastava2014]. The key idea behind MC dropout is to use standard dropout during testing. Dropout is a technique often applied to prevent a network from overfitting and thus improve its generalizability [Srivastava2014, DO_survey]. This is achieved by taking each node out of the network with some probability during training and then multiplying the network’s weights by during test time. Equivalently, as done in PyTorch and TensorFlow dropout layers, nodes can be dropped with probability and the value of the remaining nodes scaled by during training resulting in no scaling of the weights during testing. So, MC dropout applies dropout on a network at test time, i.e., when the input is evaluated by forward-propagation, each node is dropped out with probability and the output of the remaining nodes is scaled by . Therefore, forward-propagating the same input through a network using dropout will produce different outputs on each pass from which we can obtain a measure of variability for the output. Formulating this mathematically, let be the output of a DL model for input on the -th pass through the network using MC dropout. From total passes through the network, we can compute a sample mean for the output of input as,

(4)

and a sample variance,

(5)

where the squared terms in Eq. (5) are squared element-wise, i.e. for all where is the -th element of . A crucial observation is that MC dropout derives variability measures from repeated evaluations of a DL model rather than large numbers of repeated training of the same model or through the construction of another network. Thus, it satisfies the general restrictions imposed by the characteristics of our loss functions in Eq. (1).

(a) Time series prediction of temperature in Melbourne, Australia
(b) CIFAR10 image classification
Fig. 1: Application of Monte Carlo dropout to a DL models. In Fig. 0(a), the yellow and green regions represent the and standard deviations from the mean (red line). The cyan lines are the MC dropout predictions and the black dashed lines are the mean predictions from each of the 5 independent trial models. The DL model used “prediction-on-prediction”, i.e., the model’s predicted values were utilized in making future predictions. In Fig. 0(b)

, the confidence interval around the mean probability for each class is shown for a single input image. While the right class (number 8) is correctly identified, the uncertainty of the output probability provides important information with regard to the stability and accuracy of the trained model.

Our UQ approach applies MC dropout to a relatively small set of trained models with the same architecture to form a weighted average. Combining MC dropout with repeated training of the same architecture balances the explanatory power of repetitive training with the computational frugality of MC dropout. Assume we have trained models with the same architecture for which we generate dropout masks each. We compute the expected output of a trained model with the same architecture as,

(6)

where is the output of the trained model for input without dropout at test time, is the output of the dropout model for trained model , and , with . The variance of the expected output is computed as,

(7)

Note that Eqs. (6) and (7) are weighted averages of the computed mean and variances of the trained models and the MC dropout models. These equations enable micro-predictions of the performance uncertainty of a model for a given input. The weighted average parameters and , as well as the number of iterations of MC dropout per trained model, are user-defined settings. An example of our modified MC dropout approach can be seen in Figure 1 for time series prediction of daily temperature in Melbourne, Australia, and CIFAR10 image classification. Here and , which correspond to our default values. The robustness of the model predictions can be measured, for example, in the average width of the uncertainty bands associated with each day’s temperature prediction and with each class membership probability, respectively.

Eqs. (6) and (7) can then be used to obtain confidence intervals for the loss function value corresponding to a given hyperparameter set . For example, if is the mean-squared loss function with validation data set , then we approximate the expectation over by

The intuition is that

serves as a good estimate of the actual expectation of the output over all possible trained models with architecture

. Now, from computing the outputs of each of the trained models, , and their many dropout masks, , for all of the inputs , we are able to compute the value of for each of the models utilized to compute . Taking the standard deviation of these outer loss values gives an estimate for the variability in the outer loss function for a model with architecture . Letting the value of computed from be the center and using the aforementioned standard deviation as a radius, we obtain a confidence interval for the outer objective.

Upon obtaining confidence intervals and output variances using Eq. (7) for , we desire to utilize this additional uncertainty information in the HPO. Our software uses uncertainty in different ways. The first option for integrating uncertainty is to use the surrogate models with the value of taken as its output when computed with . The RBF and GP based optimization approaches available in the software (see next section for details) would then run as expected with confidence intervals being computed and tabulated for all new hyperparameter settings.

The second option is to use the confidence intervals to construct an ensemble of RBFs to perform the surrogate modeling; the RBF ensemble approach uses multiple RBFs that are generated from the confidence intervals by selecting uniformly at random from the extremes of these intervals, i.e., the lower bound, center, and upper bound. Each response surface provides an estimate for the value of at potential hyperparameter candidates, (see Feature 2 for description of their generation), and we compute a mean and standard deviation over all ensemble predictions for at each candidate point. The selection of the best candidate point (and thus new hyperparameters to be evaluated) is determined by which candidate point minimizes a weighted sum of its distance to previously evaluated points and,

(8)

where . Note that if , then only the mean value of predicted by the ensemble is considered when determining the next hyperparameter setting. If , a “pessimistic” approach is taken by penalizing candidate points with large prediction variability. On the other hand, by setting , an “optimistic” stance is taken by preferring candidate points with relatively good mean performance and large standard deviation that may lead to significant improvements.

The first approach can also be adapted to incorporate the computed variances by appending them into the outer loss function . In the HYPPO software, a setting is available to add a regularization term to the outer objective to create a regulated loss function ,

(9)

where and . Thus, choosing to be large relative to the first term places a substantial emphasis on minimizing the performance variability during the surrogate modeling. A user could also tailor to magnify or dampen the variance to varying degrees or even for particular ranges of inputs by defining in a piece-wise fashion, e.g. . If this setting is applied, then the value of is used by the surrogate models while the confidence interval of is returned during the HPO as previously noted.

In Fig. 2, we show one typical result provided by HYPPO to the user, showing the distribution of trained model according to their computed loss and confidence limit. Along with the total number of trainable parameters for each model, which characterizes the complexity of the models’ architecture, this graph provides insightful information that can help users decide which model is most efficient and suitable for their problem. A combination of low uncertainty, low loss and small number of trainable parameters can be considered as an optimal choice of model.

Fig. 2:

Distributions of the loss, standard deviation and total number of trainable parameters across 825 trained multi-layer perceptron (MLP) models with different hyperparameter sets. The models were used to do time series prediction of the daily temperature in Melbourne, Australia. While complex architectures, i.e. high number of trainable parameters, are found to be clustered, low-complexity models can be identified in the low-loss, low-uncertainty region, making them suitable model solutions.

Feature 2: Surrogate Modeling

Our HPO software contains two different types of surrogate models, namely RBFs and GPs, and corresponding iterative sample strategies. The HPO method is, however, general enough and can easily be extended to other surrogate models. The RBF surrogate model is of the form,

(10)

where are the sets of hyperparameters for which the objective function has previously been evaluated, denotes the Euclidean norm, , and is a polynomial tail whose form depends on the choice for . A system of equations (see Eq. 6 in [Muller2020]), dependent on the value of (or ) at the ’s, is solved to compute , and . When using the RBF ensemble approach as described in the previous section, the various values sampled from the confidence intervals are used as the right-hand side of Eq. 6 in [Muller2020] thus generating multiple response surfaces.

When making iterative sampling decisions with the RBF model, we follow the ideas in [Regis2007b]. We generate a large set of candidate points by perturbing the best point found so far and by randomly selecting points from the search space, making sure we satisfy the integer constraints. For each candidate point, we use the RBF to predict its function value (e.g., loss) and we also compute the distance of each candidate point to all previously evaluated hyperparameters. A weighted sum of these two criteria is then used to select the best candidate point at which the next expensive evaluation (model training, MC dropout) is performed. The weights cycle through a predefined pattern to enable a repeated transition between local and global search.

The second type of surrogate model implemented in our software is GPs. GPs have the advantage that in addition to providing a prediction of the function value, they also return an uncertainty estimate of the prediction. When using a GP, we treat the function as a realization of a stochastic process:

(11)

where represents the mean of the stochastic process and . Therefore, the term represents a deviation from the mean

. It is assumed that there exists a correlation between the errors and that it depends on the distance between the hyperparameters. For an unsampled set of hyperparameters, the GP prediction represents the realization of a random variable with mean

and variance . Further details on how to compute the mean and variance can be found in  [Muller2020, Eqs. 7-13]. When using the GP surrogate model, we maximize the expected improvement [Jones1998]

auxiliary function using a genetic algorithm that can handle the integer constraints. This sampling strategy uses the GP-predicted value of the loss

and the corresponding GP-predicted uncertainty to balance a local and a global search for improved solutions in the hyperparameter space. The process of utilizing MC dropout with GP remains unchanged compared to the implementation with RBF. The only difference is that no ensemble approach, as detailed for the RBFs, is needed as the GP provides the prediction mean and standard deviation.

Fig. 3: Convergence plot showing the loss value and its 1- deviation for all function evaluations. In purple, we show the sorted loss values for a random set of 825 hyperparameter samples generated using low-discrepancy sampling. From that set, 10 evaluations with high losses were selected (red points) and used as initial evaluations for surrogate modeling. In orange, we show the best error loss as the surrogate modeling is iterating and were able to decrease by an order of magnitude the amount of evaluations required to find the most optimal region in the hyperparameter space.

In Fig. 3, we demonstrate the effectiveness of using surrogate modeling and compare the increase of convergence speed to reach the optimal point in the hyperparameter space, where the lowest loss model can be achieved with an order of magnitude fewer iterations.

The improvement in HPO performance with HYPPO from a different library, e.g. DeepHyper, is shown in Fig. 4. To do the comparison, we used the polynomial fit problem provided in DeepHyper’s online documentation111https://deephyper.readthedocs.io/en/latest/tutorials/hps_dl_basic.html

and we extended the complexity of the problem by increasing the number of hyperparameters to be optimized to six: (1) number of nodes per layer, (2) number of layers, (3) dropout rate, (4) learning rate, (5) epochs, and (6) batch size. We applied HPO with both HYPPO and DeepHyper libraries independently and evaluated the highdimensional hyperparameter space at a total of 200 iterations with each method. Ten initial evaluations were used in HYPPO to build the initial surrogate model. Our analysis shows that HYPPO and DeepHyper both find eventually models with the same performance, but HYPPO finds better hyperparameters faster (fewer iterations) than DeepHyper, adding further motivation to using HYPPO.

Fig. 4: Comparison of convergence plot between our HYPPO software and the DeepHyper HPO library. We used the polynomial fit problem provided in the DeepHyper’s online documentation, allowing 200 iterations. In this example, the value (also known as ”Coefficient of determination”) is the metric being maximized.

Feature 3: Asynchronous Nested Parallelism

One of the challenges in carrying out HPO for machine learning applications is the computational requirement needed to train multiple sets of hyperparameters. In order to address this problem, the HYPPO software handles parallelization across multiple hyperparameter evaluations and enables distributed training for each evaluation. Here, we describe how the nested parallelism feature is being implemented and used to maximize efficiency across all allocated resources.

Iv-1 Nested parallelism

Fig. 5: Diagram illustrating the nested parallelization process across 6 processors using 2 parallel steps and 3 processors per step. A single processor is used per SLURM task. If the user requests uncertainty quantification to be performed on the trained models, the outer loss will be provided along with its confidence interval. When surrogate modeling is performed, the algorithm will gather newly evaluated sample sets across all parallel steps and make a prediction for the next set of hyperparameters to evaluate.

In Fig. 5, we show how we set up the parallelization scheme in the software. The HYPPO software uses GNU parallel [tange_ole_2018_1146014] to do distributed training of multiple models in parallel. The program can automatically generate a SLURM script using the number of SLURM steps to be executed in parallel (that is, the number of individual srun instances) and the number of SLURM tasks to be executed in each step, as defined by the user in the input configuration file. Since a single processor (either GPU or CPU, the user decides) will be assigned for each task, the total number of processors used in a SLURM job is therefore equal to the number of steps times the number of tasks. For instance, proper SLURM directives for a setup with 2 srun instances running in parallel across 3 GPUs for each step will read as follows:

    #SBATCH --ntasks 6
    #SBATCH --gpus-per-task 1

In this case, --ntasks is equal to the total number of processors to be allocated for the job. Each of the 2 SLURM steps can then be executed in parallel using GNU parallel with a --jobs command set to 2. Finally, in order to avoid individual srun steps to be executed on the same GPUs, we use the --exclusive command for the srun call which will ensure that separate processors will be dedicated to each job step.

Once the program starts, workers will be initialized according to the SLURM settings. If the used machine learning library is PyTorch, the program will do the initialization using the torch.distributed package. On the other hand, if the Tensorflow library is used, the Horovod package will be used to initialize the workers. While the software is flexible enough to work with external training modules, the library used in the external package needs to be specified in the input configuration file so that the program can initialize the workers accordingly.

Using the SLURM environment variables, we can then loop over the randomly generated hyperparameter sets, ensuring that each step executes a different set of hyperparameters. This can be achieved using Python’s slicing feature where the subsequence is defined using the step ID and the total number of steps in the job. When multiple processors are requested for each evaluation, the loss for the outer objective function will always be calculated in the first processor (e.g., GPU0 if GPU processors are used) and the value will be recorded in its corresponding log file. In the meantime, the remaining processors will wait for the task to be completed by searching for the recorded value in the first processor’s log file.

Iv-2 Data / Trial Parallelization

As quantifying the uncertainty becomes more accurate with an increasing number of trials (i.e., repeatedly training the same architecture), the HYPPO software offers the possibility, in a single SLURM step, to parallelize the trials instead of the data to be trained on. Thus, in the example in Figure 5, the SLURM job corresponds to performing HPO, SLURM steps 1 and 2 correspond to two different sets of hyperparameters to be evaluated, and SLURM tasks 1-3 correspond to performing three separate trainings of each hyperparameter set. For example, suppose for a specific hyperparameter set, the model is to be trained nine times. In that case, each GPU will execute three consecutive independent trainings, and slicing will be applied to the trial indices using the extracted MPI rank and size values to ensure that the total number of trials to be computed across all GPUs in each step is equal to the number of trials requested by the user. On the other hand, if the user prioritizes the parallelization over the data (train in parallel), then in the example in Figure 5, the SLURM job still corresponds to performing HPO, SLURM steps 1 and 2 correspond to two different sets of hyperparameters to be evaluated, and SLURM tasks 1-3 perform the parallel training using a third of the data per GPU. Again, if the goal was to train each hyperparameter set nine times, each GPU will execute all nine independent training trials sequentially.

Iv-3 Asynchronous update

Fig. 6: Illustration of asynchronous surrogate modeling. Each rectangle represents an evaluation, the top number corresponds to the evaluation index and the bottom values are the indices of the evaluations that were used by the surrogate model to determine the current hyperparameter set. In this example, 16 initial evaluations were done prior to starting the surrogate modeling and asynchronous parallelism. The model was then fitted and predicted 4 new sample sets to evaluate (evaluations 17 through 20 shown in bold). Once the first evaluation completed, i.e. evaluation 18 in this example, a surrogate model was refitted using the newly computed evaluation (in addition to the first 16 evaluations) and a new set of hyperparameters to evaluate was predicted (here, evaluation 21).

The surrogate model-based optimization is updated every time a new input-output data pair is obtained during the surrogate model-based optimization. This process is usually done sequentially (one evaluation per iteration and update of the model) or synchronously parallel (multiple evaluations simultaneously and update the model only after all evaluations are completed). This may require many iterations before the surrogate model adequately represent the loss function over the entire hyperparameter space. Moreover, in HPO, each hyperparameter evaluation may require a different amount of time because architectures of different sizes have different numbers of weights and biases that the training process optimizes. Using the aforementioned nested parallelization feature, we can use this approach to predict not one but multiple sets of samples to be evaluated asynchronously in parallel. The surrogate model is then updated each time a hyperparameter set has been evaluated.

Nevertheless, unlike the initial evaluations that are performed independent of each other, the surrogate modeling process requires communication between the SLURM steps in order to retrieve newly evaluated hyperparameter sets from other srun instances and update the model. In Fig. 6 we illustrate how evaluations can complete at different times during the surrogate modeling process. After each completed evaluation, the HYPPO software reads through all the log files generated and constantly updated by each processor to search for newly computed sample sets. Once the search is complete, the surrogate model is updated and a new sample set is computed.

V Case Study: CT Image Reconstruction

This section describes a relevant scientific application from computed tomography (CT) and demonstrates the impact of our HPO process on the results. CT is a three-dimensional imaging technique that measures a series of two-dimensional projections of an object rotated on a fixed axis relative to the direction of an X-ray beam. The collection of projections from different angles at the same cross-sectional slice of the object is called a 2D sinogram, which is the input to a reconstruction algorithm. The final 3D reconstructed volume is called a tomogram, which is assembled from the independent reconstructions of each measured sinogram. Tomographic imaging is used in various fields, including biology [wise_micro-computed_2013], medicine [rubin_computed_2014], materials science [garcea_x-ray_2018], and geoscience [cnudde_high-resolution_2013], allowing for novel observations that enable enhanced structural analysis of subjects of interest.

High-quality tomographic reconstruction generally requires measuring projections at many angles. However, collecting sparsely sampled CT data (i.e., sparse distribution of angle measurements) has the benefit of exposing the subject to less radiation and improving temporal resolution. In sparse angle CT, where few angles are available in the measured sinograms, computing the reconstruction is a severely ill-posed inverse problem. When using standard algorithms, the reconstructions will be noisy and obstructed with streak-like artifacts. In recent years, a class of algorithms using deep learning has come into prominence [parkinson_machine_2017, pelt_improving_2018, ayyagari_image_2018, huang_investigations_2018, bazrafkan_deep_2019, fu_hierarchical_2019, he_radon_2019]. Such approaches show superior performance compared to standard algorithms by using a trained neural network to solve the inverse problem of sparse angle CT reconstruction.

Here, we use HPO to optimize a deep neural network architecture for performing sinogram inpainting, an approach that has found success in other works [lee_view-interpolation_2017, li_sinogram_2019]. The missing angles of a sparsely sampled sinogram are filled in by a trained neural network, after which the completed sinogram can be reconstructed using any standard algorithm.

V-a Data and Architecture

Fig. 7: Test data examples for the CT reconstruction problem.

We use a simulated dataset created with XDesign, a Python package for generating X-ray imaging phantoms [ching2017xdesign]. The dataset comprises images of pixels ( training, validation, test examples) of circles of various sizes, emulating the different feature scales present in experimental data as shown in Fig. 7. We use TomoPy [gursoy2014tomopy], a Python package for tomographic image reconstruction and analysis, to generate sinograms of these images with angles. To emulate sparse angle CT, every other angle is removed from the sinogram and Poisson noise is added.

The neural network architecture used in this problem takes the form of a U-Net, a type of CNN that has shown success in biomedical image processing problems [ronneberger2015u]. The input of the U-Net is the sparse angle sinogram, and the desired output is the completed sinogram. Our variation of the U-Net architecture consists of an equivalent number of downsampling and upsampling blocks. Within each block, several intermediate convolution layers preserve the size of the input, and a final convolution layer increases the number of feature maps (channels). We selected eight hyperparameters for HPO, enumerated in Table I. HPO aims to minimize the mean squared error between the actual and predicted sinograms of the validation dataset.

Parameters (a) (b) (c) (d)
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
MSE E E E E
PSNR 35.2 34.6 18.7 16.4
SSIM 0.965 0.988 0.970 0.955
TABLE I:

A total of eight hyperparameters were selected for hyperparameter optimization: (1) number of output feature maps of the initial block, (2) multiplier for the number of feature maps in each subsequent block, (3) number of blocks, (4) intermediate layers, (5) convolutional kernel size and (6) stride of the final layer in each block, (7) dropout probability and (8) intermediate layer kernel size. The first eight rows show the value of each hyperparameter and the last four rows contain the training results and image metric quantities averaged over the test dataset. Columns (a) and (d) are the results of the network trained with the minimum and maximum values for each hyperparameter, respectively. Columns (b) and (c) are the results of training the neural network with the best and worst hyperparameter values sampled by HYPPO, respectively.

The resulting sinogram from the trained U-Net is reconstructed by TomoPy using the Simultaneous Iterative Reconstruction Technique (SIRT) [GILBERT1972105], which has shown the ability to produce accurate reconstructions given incomplete datasets. SIRT iteratively minimizes the error between the measured projections and the projections calculated from the reconstruction at the current iteration . The average error is then backprojected to refine the reconstruction at each iteration. For an inverse problem , the SIRT update equation is:

where C and R are diagonal matrices that contain the inverse of the sum of the columns and rows, respectively, of A.

V-B Results

Fig. 8: Job speedup as a function of the number of SLURM steps and number of SLURM tasks is shown for evaluating 50 different hyperparameter sets and five trials for each. Up to 96 GPU processors from the NERSC/Cori cluster were used to evaluate performance and scalability of our software. Lower speedup values correspond to better performance.

We analyze two aspects of applying HPO to the case study of CT image reconstruction: 1) job speedup as a function of the number of SLURM steps and SLURM tasks for the specific configuration of running the evaluation operation (see Sec. IV-2) and 2) improvement due to optimizing the hyperparameters. HYPPO is executed on the Cori supercomputer, housed at the National Energy Research Scientific Computing Center at Lawrence Berkeley National Laboratory (NERSC). Cori is a Cray XC40 system comprised of 2,388 nodes each containing two 2.3 GHz 16-core Intel Haswell processors and 128 GB DDR4 2133 MHz memory, and 9,688 nodes each containing a single 68-core 1.4 GHz Intel Xeon Phi 7250 (Knights Landing) processor and 96 GB DDR4 2400 GHz memory. Cori also provides 18 GPU nodes, where each node contains two sockets of 20-core Intel Xeon Gold 6148 2.40 GHz, 384 GB DDR4 memory and 8 NVIDIA V100 GPUs (each with 16 GB HBM2 memory). We use the Haswell processor and the GPU nodes for the results presented in this section.

Fig. 9: Scatter plot of median loss value versus median absolute deviation. The loss and median absolute deviation are computed over 50 training trials for each of 50 hyperparameter evaluations. In the bottom left region, a simple architecture (less than 50k trainable parameters) can be found which provides an accurate (low loss) and efficient (low variability) solution.
Fig. 10: Comparison of the complete sinogram with the sparse and inpainted (neural network output) sinograms for a test example. The reconstruction for each sinogram is performed using the SIRT algorithm, described in Section V-A

. Error metrics of per-pixel mean squared error (MSE, lower is better), structural similarity (SSIM, larger is better), and peak signal-to-noise ratio (PSNR, larger is better) compared to the reference reconstruction, showing the improvement of using the inpainted sinogram with optimized network hyperparameters over the sparse sinogram and inpainted sinogram with sub-optimal network hyperparameters.

The first experiment consists of running 50 different hyperparameter evaluations with five trials for each evaluation. Fig. 8 reports the maximum speedup resulting from this experiment across each SLURM step and SLURM task. The neural network is trained for iterations using the entire training dataset of images. We observe an improvement of two orders of magnitude in speedup between the combination of 1 SLURM task/1 SLURM step and 6 SLURM tasks/16 SLURM steps.

Fig. 9 shows the outer loss function value plotted against the median absolute deviation. Here, we evaluated each hyperparameter set times to compute the median outer loss function value and standard deviation. An outer loss value of 24.81 is achieved within four iterations when performing Gaussian process surrogate modeling, which may not be guaranteed by random sampling alone, emphasizing the importance of HPO in this case study.

Fig. 11: The absolute error maps for each example in Fig. 10. The baseline comparison for each error map is the complete sinogram.

We also assess the performance of the U-Net after running HPO, comparing the quality of the reconstruction obtained with the neural network output to the original (complete) sinogram. The optimal network hyperparameters found with HPO show improved reconstructions compared to other hyperparameters sampled during HPO (see Figs. 10 and 11, and Table I) demonstrating the importance of HPO to this case study. We also show the results of training the network for values at the lower and upper bounds (set by the user) of all eight hyperparameters, which define the hyperparameter search space. For the results in Figs. 10 and 11, and Table I, each neural network architecture was trained for iterations. We use the structural similarity index measure (SSIM) to quantify how similar the sparse and inpainted sinogram reconstructions are to the complete sinogram reconstruction. SSIM values range between and , with values closer to indicating very similar structures between the two images being compared. SSIM, unlike MSE and PSNR, does not measure absolute error, allowing structural differences to be characterized.

Vi Discussions

In the previous sections we demonstrated different features of our HYPPO software for finding optimal hyperparameters of neural nets. Although our results are promising on different applications, two aspects of our HYPPO method require further development in order to improve its efficiency. First, a sensitivity analysis (SA) of the model’s performance regarding the hyperparameters is needed. If we could identify the subset of hyperparameters that most impact the model’s performance, we could significantly reduce the number of hyperparameter sets that need to be tried during the optimization because the search space would be smaller. Thus, additional savings in optimization time could be achieved. Second, our current initial experimental design is created by randomly sampling integer values from the hyperparameter space. Using a type of space-filling design (e.g., a low-discrepancy sequence) instead would be preferable, and thus our initial surrogate models could be improved. However, there are obstacles for both space-filling designs and sensitivity analysis when parameters have integer constraints, as in our problem. Off-the-shelf SA methods such as the ones implemented in the SALib open-source library in Python 

[Herman2017] only work for continuous parameters. Low discrepancy sampling methods such as Sobol’s sequences [sobol2001] generate evenly distributed points across the sample space, avoiding large clusters and gaps between the points. However, these methods are not easily modified for integer constraints and must be developed further. Simply rounding or truncating continuous values to obtain integers does not deliver the required sample characteristics for SA and sample designs to be maximally effective. However, it is possible to formulate and solve an integer optimization problem to achieve the desired sample distribution. There is also an opportunity to modify the computation of the sample points in Sobol’s sequences. These aspects will be a future feature in our software.

Another aspect in our HYPPO software that requires further analysis is the parameters, including initial experimental design sizes, the weights and used to balance the importance of expected performance and uncertainty, the number of times a given hyperparameter set should be trained, and how many hyperparameters should be tried in total. The weights and are user-defined and should reflect the user’s averseness to model variability. The number of hyperparameters to be tried in total usually depends on the amount of available compute budget. The size of the initial experimental design influences how well the initial surrogate model approximates the objective function. Typically, larger initial experimental designs give better surrogate model performance at first, but it also means that fewer adaptively selected hyperparameters will be evaluated due to the compute budget. Another potentially impactful parameter is the dropout rate. Currently, a default value is used, but in the future we will include it in our hyperparameters to be optimized. Lastly, in this study we did not take noise in the data into account. In the future we will analyze how small variations in the training data propagate through the network and impact the predictive performance and reliability of the DL models.

Vii Conclusions

In this paper, we demonstrated a new method for conducting hyperparameter optimization of deep neural networks while taking into account the prediction uncertainty that arises from using stochastic optimizers for training the models. Computationally cheap surrogate models are exploited to reduce the number of expensive model trainings that have to be done. We showed the quality of the solutions found on a variety of datasets and how asynchronous nested parallelism can be exploited to significantly accelerate the time-to-solution. The HYPPO software comes with a number of features, allowing the user to conduct simple HPO, UQ, and HPO under uncertainty. Model variability that arises from using stochastic optimizers when training models is rarely addressed in ML literature, but it can have a considerable effect, particularly in scientific applications for which dataset sizes are often limited. High variability of the model performance can have a significant impact on decisions being made, and these decisions should be made with confidence by using reliable models. Our software is a first step towards providing these much needed reliable and robust models. Finally, HYPPO was developed as an open-source software and will be made publicly available in the future via the pip Python package manager. More information about the software, including a detailed documentation can be found at https://hpo-uq.gitlab.io/.

Reproducibility

As we strive to make this research fully transparent, step-by-step instructions are provided in the HYPPO online documentation (see link in the previous section) to allow complete reproduction of this work.

Acknowledgement

Vincent Dumont, Mariam Kiran, and Juliane Mueller are supported by the Laboratory Directed Research and Development Program of Lawrence Berkeley National Laboratory and the Office of Advanced Scientific Computing Research under U.S. Department of Energy Contract No. DE-AC02-05CH11231. Vidya Ganapati and Talita Perciano were supported by the U.S. Department of Energy, Office of Science, Office of Workforce Development for Teachers and Scientists (WDTS) under the Visiting Faculty Program (VFP). Casey was supported by National Science Foundation (NSF) through the Mathematical Sciences Graduate Internship (MSGI) program. Anuradha Trivedi was supported in part by the U.S. Department of Energy, Office of Science, Office of Workforce Development for Teachers and Scientists (WDTS) under the Science Undergraduate Laboratory Internship (SULI) program, and in part by the Office of Advanced Scientific Computing Research, of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231, through the grant “Scalable Data-Computing Convergence and Scientific Knowledge Discovery”, which also partially supported Talita Perciano.

This research used resources of the National Energy Research Scientific Computing Center (NERSC), a U.S. Department of Energy Office of Science User Facility located at Lawrence Berkeley National Laboratory, operated under Contract No. DE-AC02-05CH11231.

References